Thursday, 11 May 2017

Quantifying Overlap: A Shiny App for NWAV 44 (Chicago, 2014)

Pillai (actually 1 - Pillai):
df <- data.frame(x = F2, y = F1, class = vowel.class)
m <- lm(cbind(df$x, df$y) ~ df$class)
pillai <- 1 - anova(m)["df$class", "Pillai"]

Bhattacharyya affinity:
df <- data.frame(x = F2, y = F1, class = vowel.class)
spdf <- SpatialPointsDataFrame(cbind(df$x, df$y), data.frame(class = df$class))
ba <- kerneloverlap(spdf, method = "BA", kern = "epa")[1, 2]

Today at NWAV in Toronto, I ran out of time, but presented most of this presentation. This is not Powerpoint slides, but a Shiny app, which contains an interactive Overlap Simulator and an ANAE Explorer for the low back vowels. I think that interactive apps like this can be very useful as part of presentations and even publications, as we move away from the model of the traditional paper journal. The R/Shiny code that makes up the app is here and here, but please note that this was my first time trying to write this kind of code! Included in the code are functions for the Pillai score, the Bhattacharyya affinity, and the "Closest Centroid Correct" measure discussed in the text.

To summarize my talk, the popular Pillai score -- as noted by Nycz & Hall-Lew -- is not really a measure of the overlap of two clouds of points, such as vowel tokens from two different word classes. Pillai (a parametric statistic making several assumptions about the data) is more similar to an R-squared measurement, asking what proportion of the total variability in the data is "explained" by the difference (in means) between the two categories. Even when two clouds are clearly non-overlapping, there is still residual variation in each cloud, which means that Pillai will not come out as 1. On the other hand, if the means of the two groups are equal, Pillai will always come out as 0, even if the clouds have different shapes and are not technically showing complete overlap. Finally, Pillai is sensitive to imbalance in token numbers between word classes. If one class has more data than the other, Pillai suggests that there is more overlap than if the number of tokens were equal.

The Overlap Simulator allows the user to observe these drawbacks of the Pillai score, and to note that the Bhattacharyya affinity (or coefficient) generally does not suffer from the same problems (although it is also skewed, to a lesser degree, when the tokens are imbalanced across groups). BA was explicitly designed as a measure of the overlap of two continuous distributions, and has a very simple mathematical formula: multiply the class probabilities, take the square root, and integrate over the plane. For R to estimate and implement BA, though, a few parameters must be set: the type of kernel, the kernel bandwidth, and the grid size. I have mainly used the default values for these.

Another measure of overlap, which I came up with (as far as I know), is the Closer Centroid Correct Criterion (CCCC). This seems to perform similarly to BA, although it tends to have a lower value (when converted to a scale where 0 means no overlap and 1 means complete overlap). One possible advantage of CCCC is that its calculation is very simple: it represents the chance that a point is closer to the centroid or mean of its own class rather than the other class. This seems like it could reflect the amount of confusion that a listener might have in distinguishing two vowel classes in the speech of another person, and also would presumably (?) be computationally/brain-instantiable much more readily than the Bhattacharyya method, which involves estimation, multiplication, and integration of two-dimensional probability distributions.

While results from the ANAE Explorer were preliminary, it was clearly evident that the Pillai metric failed to reflect the degree of low back separation of some of the speakers in the Mid-Atlantic and Inland North region. Another point to mention is that it makes a big difference whether the overlap of the LOT and THOUGHT vowels is assessed with or without making an adjustment for phonetic environment.

Experimenting with this adjustment -- which amounts to working with the residuals from a regression model that fits preceding- and following-segment coefficients pooled across all speakers -- shows that LOT and THOUGHT usually appear to overlap more once phonetic effects are taken into account. An extreme example of this is Gus K. from Nashville, TN, whose unadjusted BA was .320, but whose adjusted BA was .818. However, factoring out phonetic environment can sometimes have the opposite effect, like for Tony M. from Knoxville, TN, whose unadjusted BA was .690 and whose adjusted BA was .234.

No comments:

Post a Comment