Thursday 11 May 2017

Are You Talkin' To ME? In defense of mixed-effects models


At NWAV 43 in Chicago, Joseph Roy and Stephen Levey presented a poster calling for "caution" in the use of mixed-effects models in situations where the data is highly unbalanced, especially if some of the random-effects groups (speakers) have only a small number of observations (tokens).

One of their findings involved a model where a certain factor received a very high factor weight, like .999, which pushed the other factor weights in the group well below .500. Although I have been unable to look at their data, and so can't determine what caused this to happen, it reminded me that sum-contrast coefficients or factor weights can only be interpreted relative to the other ones in the same group.

An outlying coefficient C does not affect the difference between A and B. This is much easier to see if the coefficients are expressed in log-odds units. In log-odds, it seems obvious that the difference between A: -1 and B: +1 is the same as the difference between A': -5 and B': -3. The difference in each case is 2 log-odds.

Expressed as factor weights -- A: .269, B: .731; A': .007, B': .047 -- this equivalence is obscured, to say the least. It is impossible to consistently describe the difference between any two factor weights, if there are three or more in the group. To put it mildly, this is one of the disadvantages of using factor weights for reporting the results of logistic regressions.

Since factor weights (and the Varbrul program that produces them) have several other drawbacks, I am more interested in the (software-independent) question that Roy & Levey raise, about fitting mixed-effects models to unbalanced data. Even though handling unbalanced data is one of the main selling points of mixed models (Pinheiro & Bates 2000), Roy and Levey claim that such analyses "with less than 30-50 tokens per speaker, with at least 30-50 speakers, vastly overestimate variance", citing Moineddin et al. (2007).

However, Moineddin et al. actually only claim to find such an overestimate "when group size is small (e.g. 5)". In any case, the focus on group size points to the possibility that the small numbers of tokens for some speakers is the real issue, rather than the data imbalance itself.

Fixed-effects models like Varbrul's vastly underestimate speaker variance by not estimating it at all and assuming it to be zero. Therefore, they inflate the significance of between-speaker (social) factors. P-values associated with these factors are too low, increasing the rate of Type I error beyond the nominal 5% (this behavior is called "anti-conservative"). All things being equal, the more tokens there are per speaker, the worse the performance of a fixed-effects model will be (Johnson 2009).

With only 20 tokens per speaker, the advantage of the mixed-effects model can be small, but there is no sign that mixed models ever err in the opposite direction, by overestimating speaker variance -- at least, not in the balanced, simulated data sets of Johnson (2009). If they did, they would show p-values that are higher than they should be, resulting in Type I error rates below 5% (this behavior is called "conservative").

It is difficult to compare the performance of statistical models on real data samples (as Roy and Levey do for three Canadian English variables), because the true population parameters are never known. Simulations are a much better way to assess the consequences of a claim like this.

I simulated data from 20 "speakers" in two groups -- 10 "male", 10 "female" -- with a population gender effect of zero, and speaker effects normally distributed with a standard deviation of either zero (no individual-speaker effects), 0.1 log-odds (95% of speakers with input probabilities between .451 and .549), or 0.2 log-odds (95% of speakers between .403 and .597).

The average number of tokens per speaker (N_s) ranged from 5 to 100. The number of tokens per speaker was either balanced (all speakers have N_s tokens), imbalanced (N_s * rnorm(20, 1, 0.5), or very imbalanced (N_s * rnorm(20, 1, 1). Each speaker had at least one token and no speaker had more than three times the average number of tokens.

For each of these settings, 1000 datasets were generated and two models were fit to each dataset: a fixed-effects model with a predictor for gender (equivalent to the "cautious" Varbrul model that Roy & Levey implicitly recommend), and a mixed-effects (glmer) model with a predictor for gender and a random intercept for speaker. In each case, the drop1 function (a likelihood-ratio test) was used to calculate the Type I error rate -- the proportion of the 1000 models with p < .05 for gender. Because there is no real gender effect, if everything is working properly, this rate should always be 5%.



For each panel, the figure above plots the proportion of significant p-values (p < .05) obtained from the simulation, in blue for the fixed-effects model and in magenta for the mixed model. A loess smoothing line has been added to each panel. Again, since the true population gender difference is always zero, any result deemed significant is a type I error. The figure shows that:

1) If there is no individual-speaker variation (left column), the fixed-effects model appears to behave properly, with 5% Type I error, and the mixed model is slightly conservative, with 4% Type I error. There is no effect of the average number of tokens per speaker (within each panel), nor is there any effect of data imbalance (between the rows of the figure).

2) If there is individual-speaker variation (center and right columns), the fixed-effects model error rate is always above 5%, and it increases roughly linearly in proportion to the number of tokens per speaker. The greater the individual-speaker variation, the faster the increase in the Type I error rate for the fixed-effects model, and therefore the larger the disadvantage compared with the mixed model.

The mixed model proportions are much closer to 5%. We do see a small increase in Type I error as the number of tokens per speaker increases; the mixed model goes from being slightly conservative (p-values too high, Type I error below 5%) to slightly anti-conservative (p-values too low, Type I error above 5%).

Finally, there is a small increase in Type I error associated with greater data imbalance across groups. However, this effect can be seen for both types of models. There is no evidence that mixed models are more susceptible to error from this source, either with a low or a high number of average tokens per speaker.

In summary, the simulation does not show any sign of markedly overconservative behavior from the mixed models, even when the number of tokens per speaker is low, and the degree of imbalace is high. This is likely to be because the mixed model is not "vastly overestimating" speaker variance in any general way, despite Roy & Levey's warnings to the contrary.

We can look at what is going on with these estimates of speaker variance, starting with a "middle-of-the-road" case where the average number of tokens per speaker is 50, the true individual-speaker standard deviation is 0.1, and there is no imbalance across groups.

For this balanced case, the fixed-effects model gives an overall Type I error rate of 6.4%, while the mixed model gives 4.4%. The mean estimate of individual-speaker variance, in the mixed model, is 0.063. Note that this average is an underestimate, not an overestimate, of the variance in the population, which is 0.1.

Indeed, in 214 of the 1000 runs, the mixed model underestimated the speaker variance as much as it possibly could: it came out as zero. For these runs, the proportion of Type I error was higher: 6.1%, and similar to the fixed-effects model, as we would expect.

In 475 of the runs, a positive speaker variance was estimated that was still below 0.1, and the Type I error rate was 5.3%. And in 311 runs, the variance was indeed overestimated, that is, it was higher than 0.1. The Type I error rate for these runs was only 1.9%.

Mixed models can overestimate speaker variance -- incidentally, this is because of the sample data they are given, not because of some glitch -- and when this happens, the p-value for a between-speaker effect will be too high (conservative), compared to what we would calculate if the true variance in the population were known. However, in just as many cases, the opposite thing happens: the speaker variance is underestimated, resulting p-values that are too low (anti-conservative). On average, though, the mixed-effects model does not behave in an overly conservative way.

If we make the same data quite unbalanced across groups (keeping the average of 50 tokens per speaker and the speaker standard deviation of 0.1), the Type I error rates rise to 8.3% for the fixed-effects model and 5.6% for the mixed model. So data imbalance does inflate Type I error, but mixed models still maintain a consistent advantage. And it is still as common for the mixed model to estimate zero speaker variance (35% of runs) as it is to overestimate the true variance (28% of runs).

I speculated above that small groups -- speakers with few tokens -- might pose more of a problem than unbalanced data itself. Keeping the population speaker variance of 0.1, and the high level of data imbalance, but considering the case with only 10 tokens per speaker on average, we see that the Type I error rates are 4.5% for fixed, 3.0% for mixed.

The figure of 4.5% would probably average out close to 5%; it's within the range of error exhibited by the points on the figure above (top row, middle column). Recall that our simulations go as low as 5 tokens per speaker, and if there were only 1 token per speaker, no one would assail the accuracy of a fixed-effects model because it ignored individual-speaker variation (or, put another way, within-speaker correlation). But sociolinguistic studies with only a handful of observations per speaker or text are not that common, outside of New York department stores, rare discourse variables, and historical syntax.

For the mixed model, the Type I error rate is the lowest we have seen, even though only 28% of runs overestimated the speaker variance. Many of these overestimated it considerably, however, contributing to the overall conservative behavior.

Perhaps this is all that Roy & Levey intended by their admonition to use caution with mixed models. But a better target of caution might be any data set like this one: a binary linguistic variable, collected from 10 "men" and 10 "women", where two people contributed one token each, another contributed 2, another 4, etc., while others contributed 29 or 30 tokens. As much as we love "naturalistic" data, it is not hard to see that such a data set is far from ideal for answering the question of whether men or women use a linguistic variable more often. If we have to start with very unbalanced data sets, including groups with too few observations to reasonably generalize from, it is too much to expect that any one statistical procedure can always save us.

The simulations used here are idealized -- for one thing, they assume normal distributions of speaker effects -- but they are replicable, and can be tweaked and improved in any number of ways. Simulations are not meant to replicate all the complexities of "real data", but rather to allow the manipulation of known properties of the data. When comparing the performance of two models, it really helps to know the actual properties of what is being modeled. Attempting to use real data to compare the performance of models at best confuses sample and population, and at worst casts unwarranted doubt on reliable tools.


References

Johnson, D. E. 2009. Getting off the GoldVarb standard: introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass 3/1: 359-383.

Moineddin, R., F. I. Matheson and R. H. Glazier. 2007. A simulation study of sample size for multilevel logistic regression models. BMC Medical Research Methodology 7(1): 34.

Pinheiro, J. C. and D. M. Bates. 2000. Mixed-effect models in S and S-PLUS. New York: Springer.

Roy, J. and S. Levey. 2014. Mixed-effects models and unbalanced sociolinguistic data: the need for caution. Poster presented at NWAV 43, Chicago. http://www.nwav43.illinois.edu/program/documents/Roy-LongAbstract.pdf

No comments:

Post a Comment