Thursday, 11 May 2017

On Exactitude In Science: A Miniature Study On The Effects Of Typical And Current Context

In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province. In time, even these immense Maps no longer satisfied, and the Cartographers' Guilds surveyed a Map of the Empire whose Size was that of the Empire, and which coincided point for point with it. The following Generations, less addicted to the Study of Cartography, realized that that vast Map was useless, and not without some Pitilessness delivered it up to the Inclemencies of Sun and Winter. In the Deserts of the West, there remain tattered Ruins of that Map, inhabited by Animals and Beggars; in all the Land there is no other Relic of the Disciplines of Geography.  (J. L. Borges)

A scientific model makes predictions based on a number of variables and parameters. The more complex the model, the more accurate its predictions. But all things being equal, a simpler model is preferred. As Newton put it: "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances."

Exemplar Theory makes predictions about the present or future based on an enormous amount of stored information about the past. For example, a speaker is said to pronounce a word by taking into account the thousands of ways he or she has produced and heard it before. If such feats of memory are possible - I ain't sayin' I don't believe Goldinger 1998 - we should not be surprised by the accuracy of models that rely on them. And if language can be shown to rely on them, so be it. But the abandonment of parsimony, in the absence of clear evidence, should be resisted. (See the comment by S. Brown below for an alternative view of this issue.)

The same phenomena can often be accounted for "equally well" by a more deductive traditional theory or by a more inductive, bottom-up approach. The chemical elements were assigned to groups (alkali metals, halogens, etc.) because of their similar physical properties and bonding behavior long before the nuclear basis for these similarities was discovered. In biology, the various taxonomic branchings of species can be thought of as a reflection and continuation of their historical evolution, but the differences exist on a synchronic level as well - in organisms' DNA.

In classical mechanics, if an object begins at rest, its current velocity can be determined by integrating its acceleration over time. By storing the details of acceleration at every point along a trajectory, the current velocity can be calculated by integration: v(t) = ∫ a(t) dt. If we ride a bus with our eyes closed and our fingers in our ears, we can estimate our present speed if we remember everything about our past accelerations.

A free-falling object, under the constant acceleration of gravity, g, has velocity v(t) = g · t. But a block sliding down a ramp made up of various curved and angled sections, like a roller coaster, has an acceleration that changes with time. The acceleration at any moment is proportional to the sine of the slope of the ramp. Integrating, v(t) = g · ∫ sin(θ(t)) dt.

On a simple inclined plane, the angle is constant, so the acceleration is too. The velocity increases linearly: like free-fall, only slower. If the shape of the ramp is complicated, solving the integral of acceleration can be very difficult. (It might be beyond the capacity of the brain to calculate - but on a real roller coaster, we don't have to remember the past ten seconds to know how fast we are going now! We use other cues to accomplish that.)

But forgetting integration, we can solve for velocity in another way, showing that it depends only on the vertical height fallen: v = sqrt(2 · g · h). Obviously this is simpler than keeping track of complex and changing accelerations over time. This equation, rewritten as 1/2 · v2 = g · h, also reflects the balance between kinetic and potential energy, one part of the important physical law of conservation of energy. Instead of a near-truism with a messy implementation, we have an elegant and powerful principle.

Both expressions for velocity fit the same data, but to call the second an "emergent generalization", à la Bybee 2001, ignores its universality and demotivates the essential next step: the search for deeper explanations.

Admittedly, this physical allegory is unlikely to convince any exemplar theorists to recant. But we should realize that given its powerful and unexplanatory nature, any correct predictions made by ET do not constitute real evidence in its favor. We need to determine if the data also support an alternative theory, or at least find places where the data is more compatible with a weaker version of ET, rather than a stronger one.

A recently-discussed case suggests that with respect to phonological processes, individual words are not only influenced by their current contexts, but also, to a lesser degree, by their typical contexts (Guy, Hay & Walker 2008; Hay 2013). This is one of several recent studies by Hay and colleagues that show widespread and principled lexical variation, well beyond the idiosyncratic lexical exceptionalism sometimes acknowledged in the past, e.g. "when substantial lexical differences appear in variationist studies, they appear to be confined to a few lexical items" (Guy 2009).

The strong-ET interpretation is that all previous pronunciations of a word are stored in memory, and this gives us the typical-context distribution for each word. But if this is the case, the current-context effect must derive from something else: either from mechanical considerations or from analogy to other words. It can't also reflect sampling from sub-clouds of past exemplars, because that would cancel out the typical-context effect.

For words to be stored along with their environments is actually a weak version of word-specific phonetics (Pierrehumbert 2002). It is not that words are explicitly marked to behave differently; they only do so because of the environments they typically find themselves in. For Yang (2013: 6325), "these use asymmetries are unlikely to be linguistic but only mirror life." But whether they mirror life or reflect language-internal collocational structures, these asymmetries are not properties of individual words.

Under this model of word production - sampling from the population of stored tokens, then applying a constant multiplicative contextual effect - we observe the following pattern (in this case, the process is t/d-deletion, as in Hay 2013; the parameters are roughly based on real data):

Exemplar Model: Contextual Effect Greatest When Pool Is Least Reduced

This pattern has two main features: as words' typical contexts favor retention more, retention rates increase linearly before both V and C, with a widening gap between the two. From the Exemplar Theory perspective, when the pool of tokens contains mainly unreduced forms, the differences between constraints on reduction can be seen more clearly. But when many of the tokens in the pool are reduced already, the difference between pre-consonantal and pre-vocalic environments appears smaller. Such reduced constraint sizes are the natural result when a process of "pre-deletion" intersects with a set of rules or constraints that apply later, as discussed in this post, and in a slightly different sense in Guy 2007.

An alternative to storing every token is to say that words acquire biases from their contexts, and that these biases become properties of the words themselves. The source of a bias could be irrelevant to its representation - one word typically heard before consonants, another typically heard from African-American speakers, and another typically heard in fast speech contexts could all be marked for "extra deletion" in the same way.

From the point of view of parsimony, this is appealing. To figure out how a speaker might pronounce a word, the grammar would have to refer to a medium-sized list of by-word intercepts, but not search through a linguistic biography thick and complex enough to have been written by Robert Caro.

But the theoretical rubber needs to hit the empirical road, or else we are just spinning our wheels here. So, compared to the Stor-It-All model, does a stripped-down word-intercept approach make adequate predictions, or - dare we hope - even better ones? Are the predictions even that different?

If we assume that for binary variables, by-word intercepts (like by-speaker intercepts) combine with contextual effects additively on the log-odds scale (which seems more or less true), we obtain a pattern like this:

Intercept Model: Typical Context Combines W/ Constant Current Context

Although the two figures are not wildly different, we can see that in this case, there is no steady separation of the _V and _C effects as overall retention increases. The following-segment effect is constant in log-odds (by stipulation), and this manifests as a slight widening near the center of the distribution. The effects of current context and typical context are independent in this model, as opposed to what we saw above.

As usual, the Buckeye Corpus (Pitt et al. 2007) is a good proving ground for competing predictions of this kind. The Philadelphia Neighborhood Corpus has a similar amount of t/d coded (with more coming soon). Starting with Buckeye, I only included tokens of word-final t/d that were followed by a vowel or a consonant. I excluded all tokens with preceding /n/, in keeping with the sociolinguistic lore, "Beware the nasal flap!" I then restricted the analysis to words with at least 10 total tokens each - and excluded the word "just", because it had about eight times as many tokens as any of the other words. I was left with 2418 tokens of 69 word types.

Incidentally, there is no significant relationship between a word's overall deletion rate and its (log) frequency, whether the frequency measure is taken from the Buckeye Corpus itself (p = .15) or from the 51-million-word corpus of Brysbaert et al. 2013 (p = .37). The absence of a significant frequency effect on what is arguably a lenition process goes against a key tenet of Exemplar Theory (Bybee 2000, Pierrehumbert 2001), but the issue of frequency is not our main concern here.

I first plotted two linear regression lines, one for the pre-vocalic environments and one for the the pre-consonantal environments. The regressions were weighted according to the number of tokens for each word. I then tried a quadratic rather than a linear regression. However, these curves did not provide a significantly better fit to the data - p(_V) = .57, p(_C) = .46 - so I retreated to the linear models. The straight lines plotted below look parallel; in fact the slope of the _V line is 0.301 and the slope of the _C line is 0.369. Since the lines converge slightly rather than diverging markedly, this data is less consistent with the exemplar model sketched above, and more consistent with the word-intercept model.

Buckeye Corpus: Parallel Lines Support Intercept Model, Not Exemplars

One way to improve this analysis would be to use a larger corpus, at least for the x-axis, to more accurately estimate the proportion that a given word ending in t/d is followed by a vowel rather than a consonant. For example, the spoken section of COCA (Davies 2008-) is about 250 times larger than the Buckeye Corpus. Of course, for a few words the estimate from the local corpus might better represent those speakers' biases.

Turning finally to data from the Philadelphia Neighborhood Corpus, we see a fairly similar picture. Note that some of the words' left-right positions differ noticeably between the two studies. The word "most", despite having 150-200 tokens, occurs before a vowel 75% of the time in Philadelphia, but only 52% of the time in Ohio. It is hard to think what this could be besides sampling error, but if it is that, it casts some doubt on the reliability of these results, especially as most words have far fewer tokens.

Philadelphia Neighborhood Corpus: Convergence, Not Exemplar Prediction

Regarding the regression lines, there are two main differences. First, Philadelphia speakers delete much more before consonants than Ohio speakers, while there is no overall difference before vowels. This creates the greater following-segment effect noticed for Philadelphia before.

The second difference is that in Philadelphia, a word's typical context seems to barely affect its behavior before vowels. The slope before consonants, 0.317, is close to those observed in Ohio, but the slope before vowels is only 0.143 - not significantly different from zero (p = .14). Recall that under the exemplar model, the _V slope should always be larger than the _C slope; words almost always occurring before vowels - passed, walked, talked - should provide a pool of pristine, unreduced exemplars upon which the effects of current context should be most visible.

I have no explanation at present for the opposite trend being found in Philadelphia, but it is clear that neither the PNC data nor the Buckeye Corpus data show the quantitative patterns predicted by the exemplar theory model. This, and a general preference for parsimony - in storage, and arguably in computation (again, see S. Brown below) - points to typical-context effects being "ordinary" lexical effects. "[We] shall know a word by the company it keeps" (Firth 1957: 11), but we still have no reason to believe that the word itself knows all the company it has ever kept. And to find our way forward, we may not need a map at 1:1 scale.

Thanks: Stuart BrownKyle GormanBetsy Sneller, & Meredith Tamminga.


Borges, Jorge Luis. 1946. Del rigor en la ciencia. Los Anales de Buenos Aires 1(3): 53.

Brysbaert, Marc, Boris New and Emmanuel Keuleers. 2013. SUBTLEX-US frequency list with PoS information final text version. Available online at

Bybee, Joan. 2000. The phonology of the lexicon: evidence from lexical diffusion. In Michael Barlow and Suzanne Kemmer (eds.), Usage-based models of language. Stanford: CSLI. 65-85.

Bybee, Joan. 2001. Phonology and language use. Cambridge Studies in Linguistics 94. Cambridge: Cambridge University Press.

Davies, Mark. 2008-. The Corpus of Contemporary American English: 450 million words, 1990-present. Available online at

Firth, John R. 1957. A synopsis of linguistic theory, 1930-1955. In Studies in Linguistic Analysis, Special volume of the Philological Society. nOxford: Basil Blackwell.

Guy, Gregory. 2007. Lexical exceptions in variable phonology. Penn Working Papers in Linguistics 13(2), Papers from NWAV 35, Columbus.

Guy, Gregory. 2009. GoldVarb: Still the right tool. NWAV 38, Ottawa.

Guy, Gregory, Jennifer Hay and Abby Walker. 2008. Phonological, lexical, and frequency factors in coronal stop deletion in early New Zealand English. LabPhon 11, Wellington.

Hay, Jennifer. 2013. Producing and perceiving "living words". UKLVC 9, Sheffield.

Pierrehumbert, Janet. 2001. Exemplar dynamics: word frequency, lenition and contrast. In Joan Bybee and Paul Hopper (eds.), Frequency and the emergence of linguistic structure. Amsterdam: John Benjamins. 137-157.

Pierrehumbert, Janet. 2002. Word-specific phonetics. Laboratory Phonology 7. Berlin: Mouton de Gruyter. 101-139.

Pitt, Mark A. et al. 2007. Buckeye Corpus of Conversational Speech. Columbus: Department of Psychology, Ohio State University.

Yang, Charles. 2013. Ontogeny and phylogeny of language. Proceedings of the National Academy of Sciences 110(16): 6324-6327.

No comments:

Post a Comment