Recently I started a doctoral course at Sophia, and so far it’s been really great. I’m doing everything that I want to be doing, the other students in the lab are wonderful people, and Arai-sensei has been good at giving me just the right pushes in the right directions. But I also need to be doing a lot more reading than I have been, so I decided I need to assign myself some homework to make sure that I’m keeping up with my reading. As I read through the various papers related to my field, I’m going to post summaries up here. Maybe they’ll be helpful to other people working in the field, and I hope at least that I can explain them well enough that people outside of my particular specialty domains will understand them.
For my first post in the series, I want to go through Cathy Best’s Perceptual Assimilation Model. She’s written it up in more than a few places, I think, but here I’m summarizing a book chapter that seems to get cited most frequently.
Why do we have such different skills in our second languages, compared to our first languages? In particular, why can we differentiate sounds in our first language, even from across a noisy, crowded bar, but find sounds in foreign languages much harder to pick out? And why are some sounds much easier to learn than others?
The Perceptual Assimilation Model is one attempt at answering those questions. It starts from the point of view that our perceptions are fairly accurate, and not actually hindered by our first language. But because we attune our perceptions to be very sensitive to distinctions in our first language, especially by considering multiple gestures as groupings that go together, we have difficulty hearing sounds that don’t follow those patterns. It predicts that some sound pairs will be difficult for speakers of some languages to differentiate, like /r/-/l/ for Japanese people. Sounds that are familiar, or that fit into separate categories in our native language, are much easier.
PAM extends from an epistemological position called Direct Realism. The core principle is that we directly perceive the reality of our environment, rather than filtering our perceptions through mental constructs. The competing Representationalist views of perception would say that we filter the information from our sensory organs through mental constructs and simplifications, so our vision conceives of the world as edges, angles, lines and contrasts in tone—this is a “snapshot” view of visual perception. Under Direct Realism, we actually do perceive the real world, directly, and any mental constructs come after the act of perception. This doesn’t mean that our perceptions are 100% accurate, mind you, or that it’s impossible for our perceptions to be fooled. Direct perception is not the same thing as perfect perception.
Applied to speech sounds, this means that we are able to perceive the articulatory movements in the vocal tracts through sound waves. By hearing the speech of another person, I am able to directly perceive the movements of their tongue, lips, jaw, velum, etc. This is a bit different from the Motor Theory of speech, which posits a sort of innate knowledge of the vocal tract. Direct realism doesn’t require us to have a working model of the vocal tracts pre-programmed into our minds, because we can hear the movements of the moving parts through the sound signal.
How the model works
If our perceptions directly reflect reality, then why can’t we perceive non-native sounds very well? That’s really the core function of the Perceptual Assimilation Model. We start out perceiving everything in the speech signal, hearing all of the little perturbations between the vocal folds and the lips. But over the course of the first year of life, we attune our perceptions to become more and more efficient at discriminating sounds of importance to our first language(s). We begin to conceive of sounds in terms of constellations of gestures, rather than individual, unconnected gestures occurring simultaneously. When we hear an English /ɹ/, we don’t just perceive the primary gesture of curling the tongue back, but we also associate it with the secondary gestures of pulling back the body of the tongue, rounding the lips, and lifting the jaw a bit. Perceiving that set of related gestures allows us to process our native language faster and more efficiently, and even perceive elements of the gesture constellation when background noise makes a few of the gestures harder to hear.
But the efficiency comes at a cost. Once we’ve developed these finely attuned perceptions of gesture combinations, we have difficulty perceiving speech sounds that don’t use those gestures in those combinations. English speakers expect to hear a puff of air after a /p/ sound at the beginning of a word, and when we don’t hear that little puff in French, it throws us off. We perceive the less aspirated French /p/ more slowly, or we mistakenly classify it as a /b/.
What it predicts
The Perceptual Assimilation model predicts that we will perceive non-native sounds with better or worse accuracy depending on how closely the sound maps to existing categories in our own, native sound system. Sounds that share the same constellations of gestures are quite common across languages, so it’s no mystery that Japanese listeners can hear an English /m/ sound just as well as English native speakers can. There are also those sounds that are totally foreign, sharing almost no gestures at all with native speech sounds. English speakers are relatively good at perceiving the difference between a /t/ sound and a click sound, like those present in the Xhosa language of South Africa—although they may have a tougher time discriminating between the 18 different clicks in Xhosa. A non-native sound is either ① assimilated to an existing, native category, ② taken as a non-native speech sound, or ③ heard as a non-speech sound (like choking, tapping, or snapping fingers). Combinations of these categories tell us how easy or difficult it should be for a non-native listener to perceive the difference between any two sounds in a foreign language:
- Two-Category Assimilation (TC)—the listener hears both sounds as different sounds in their native language, like /m/ versus /n/. These differences are quite easy to hear, since the task is basically the same as differentiating two sounds that the listener knows in their native language.
- Category-Goodness Difference (CG)—both sounds fit into the same category in the native sound system, but one is a better fit than the other. English uses a [k] sound with a bit of aspiration in it, but doesn’t use an ejective. An English speaker hearing a language that distinguishes the two sounds should be moderately good at distinguishing these sounds, because they’re clearly different, even though they may sound like a single sound in the L1.
- Single-Category Assimilation (SC)—both sounds could be heard as the same phoneme, but neither is really a good fit, and neither is particularly ‘better’ than the other. Japanese listeners usually hear English as /ɹ/ and /l/ as strange or poor examples of /ɾ/, and therefore they are fairly bad at discriminating the two sounds.
- Both Uncategorizable (UU)—the two sounds are very clearly foreign and don’t belong to a native language category. Xhosa clicks, for example, sound nothing like the consonants that we find in English. Sound distinctions in this category can range from really good to awful, depending on the actual sounds in question. I find it pretty easy to hear the difference between Xhosa’s /ǁ/ and /!/ sounds, but I absolutely cannot hear the difference between the plain [kǁ] and the aspirated [kǁʰ] sounds.
- Nonassimilable (NA)—both sounds are heard as non-speech sounds. If you had never heard of Xhosa before, you might hear the sounds above as tapping or popping sounds, in which case you’d probably be pretty good at discriminating them.
PAM has been backed up over the years by an awful lot of studies into the phonological systems of second language speakers. The categories of distinctions that PAM predicts really do seem to match the perceptions of real live listeners. I’m not entirely convinced of the Direct Realist view of perceptions, but leaving the philosophical underpinnings aside, PAM has been really good at providing a framework for understanding second language phonology.
Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange, Speech perception and linguistic experience: issues in cross-language research (pp. 171-204). Timonium, MD: York Press.
Next up, I’m going to look at another of the big theories in second language phonology: Jim Flege’s Speech Learning Model.