Thursday 11 May 2017

If Individuals Follow The Exponential Hypothesis, Groups Don't (And Vice Versa)


If you haven't heard of the Exponential Hypothesis, read thisthis and this, and if you want more, look here. Guy's paper inspired me to want to do this kind of linguistics. But now it seems that the patterns he so cleverly explained were just meaningless coincidences - leaving this, uncontested, as the most impressive quantitative LVC paper of all time. But I digress.

Sociolinguists have differed for aeons regarding the relationship between the individual and the group. Even Labov's clear statements along the lines that "language is not a property of the individual, but of the community" are qualified, or undermined, by defining a speech community as "a group of people who share a given set of norms of language" (see also pp. 206-210 of the same paper for a staunch defense of the study of individuals).

The typical variationist recognizes the practical need to combine data from a group of speakers, even if their theoretical goal is the analysis of individual grammars. After some years spent in ignorance of the statistical ramifications of this situation, they have now generally adopted mixed-effects regression modeling as a way to have their cake and eat it too.

But the Exponential Hypothesis is not well-equipped to bridge this gap. If each individual i retains final t/d at a rate of ri for regular past tense forms, ri2 for weak past tense forms, and ri3 for monomorphemes - and if ri varies by individual (as has always been conceded) - then the pooled data from all speakers can never show an exponential relationship.

I will demonstrate this under four assumptions of how speakers might vary: 1) the probability of retention, r, is normally distributed across the population; 2) the probability of retention is uniformly (evenly) distributed over a similar range; 3) the log-odds of retention - log(r / (1 - r)) - is normally distributed; 4) the log-odds of retention is uniformly distributed.

Using a central value for r of +2 log-odds (.881), and allowing speakers to vary with a standard deviation of 1 (in log-odds) or 0.15 (in probability), I obtained the following results, with 100,000 speakers in each simulation:

Probability NormalTheoretical (Exponential)Empirical (Group Mean)
Regular Past.862.862
Weak Past.743.759
Monomorpheme.641.679


Probability UniformTheoretical (Exponential)Empirical (Group Mean)
Regular Past.862.862
Weak Past.742.758
Monomorpheme.639.680


Log-Odds NormalTheoretical (Exponential)Empirical (Group Mean)
Regular Past.844.844
Weak Past.712.728
Monomorpheme.601.638


Log-Odds UniformTheoretical (Exponential)Empirical (Group Mean)
Regular Past.842.842
Weak Past.710.724
Monomorpheme.598.633

These simulations assume an equal amount of data from each speaker, and an equal balance of words from each speaker (which matters if individual words vary). If these conditions are not met, like in real data, the groups will likely deviate even more from the exponential pattern. Looking at it the other way round, the very existence of an exponential pattern in pooled data - as is found for t/d-deletion in English! - is evidence that the true Exponential Hypothesis, for individual grammars, is false.

P.S. Why should this be, you ask? Let me try some math.

A function f(x) is strictly convex over an interval if the second derivative of the function is positive for all x in that interval.

Now let f(x) = xn, where n > 1. The second derivative is n · (n-1) · xn-2. Since n > 1, both n and (n-1) are positive. If x is positive, xn-2 is positive, making the second derivative positive, which means that xn is strictly convex over the whole interval 0 < x < ∞.

Jensen's inequality states that if x is a random variable and f(x) is a strictly convex function, then f(E[x]) < E[f(x)]. That is, if we take the expected value of a variable over an interval, and then apply a strictly convex function to it, the result is always less than if we apply the function first, and then take the expected value of the outcome.

In our case, x is the probability of t/d retention, and like all probabilities, it lies on the interval between 0 and 1, where we know xn is strictly convex. By Jensen's inequality, E[x]n < E[xn]. This means that if we take the mean rate of retention for a group of speakers, and raise it to some power, the result is always less than if we raise each speaker's rate to that power, and then take the mean.

Therefore, the theoretical exponential rate will always be less than the empirical group mean rate, which is what we observed in all the simulations above.

No comments:

Post a Comment