The Perceptual Assimilation Model

ralaらRecently I started a doctoral course at Sophia, and so far it’s been really great. I’m doing everything that I want to be doing, the other students in the lab are wonderful people, and Arai-sensei has been good at giving me just the right pushes in the right directions. But I also need to be doing a lot more reading than I have been, so I decided I need to assign myself some homework to make sure that I’m keeping up with my reading. As I read through the various papers related to my field, I’m going to post summaries up here. Maybe they’ll be helpful to other people working in the field, and I hope at least that I can explain them well enough that people outside of my particular specialty domains will understand them.

For my first post in the series, I want to go through Cathy Best’s Perceptual Assimilation Model. She’s written it up in more than a few places, I think, but here I’m summarizing a book chapter that seems to get cited most frequently.

Why do we have such different skills in our second languages, compared to our first languages? In particular, why can we differentiate sounds in our first language, even from across a noisy, crowded bar, but find sounds in foreign languages much harder to pick out? And why are some sounds much easier to learn than others?

The Perceptual Assimilation Model is one attempt at answering those questions. It starts from the point of view that our perceptions are fairly accurate, and not actually hindered by our first language. But because we attune our perceptions to be very sensitive to distinctions in our first language, especially by considering multiple gestures as groupings that go together, we have difficulty hearing sounds that don’t follow those patterns. It predicts that some sound pairs will be difficult for speakers of some languages to differentiate, like /r/-/l/ for Japanese people. Sounds that are familiar, or that fit into separate categories in our native language, are much easier.

Direct Realism

PAM extends from an epistemological position called Direct Realism. The core principle is that we directly perceive the reality of our environment, rather than filtering our perceptions through mental constructs. The competing Representationalist views of perception would say that we filter the information from our sensory organs through mental constructs and simplifications, so our vision conceives of the world as edges, angles, lines and contrasts in tone—this is a “snapshot” view of visual perception. Under Direct Realism, we actually do perceive the real world, directly, and any mental constructs come after the act of perception. This doesn’t mean that our perceptions are 100% accurate, mind you, or that it’s impossible for our perceptions to be fooled. Direct perception is not the same thing as perfect perception.

Applied to speech sounds, this means that we are able to perceive the articulatory movements in the vocal tracts through sound waves. By hearing the speech of another person, I am able to directly perceive the movements of their tongue, lips, jaw, velum, etc. This is a bit different from the Motor Theory of speech, which posits a sort of innate knowledge of the vocal tract. Direct realism doesn’t require us to have a working model of the vocal tracts pre-programmed into our minds, because we can hear the movements of the moving parts through the sound signal.

How the model works

If our perceptions directly reflect reality, then why can’t we perceive non-native sounds very well? That’s really the core function of the Perceptual Assimilation Model. We start out perceiving everything in the speech signal, hearing all of the little perturbations between the vocal folds and the lips. But over the course of the first year of life, we attune our perceptions to become more and more efficient at discriminating sounds of importance to our first language(s). We begin to conceive of sounds in terms of constellations of gestures, rather than individual, unconnected gestures occurring simultaneously. When we hear an English /ɹ/, we don’t just perceive the primary gesture of curling the tongue back, but we also associate it with the secondary gestures of pulling back the body of the tongue, rounding the lips, and lifting the jaw a bit. Perceiving that set of related gestures allows us to process our native language faster and more efficiently, and even perceive elements of the gesture constellation when background noise makes a few of the gestures harder to hear.

But the efficiency comes at a cost. Once we’ve developed these finely attuned perceptions of gesture combinations, we have difficulty perceiving speech sounds that don’t use those gestures in those combinations. English speakers expect to hear a puff of air after a /p/ sound at the beginning of a word, and when we don’t hear that little puff in French, it throws us off. We perceive the less aspirated French /p/ more slowly, or we mistakenly classify it as a /b/.

What it predicts

The Perceptual Assimilation model predicts that we will perceive non-native sounds with better or worse accuracy depending on how closely the sound maps to existing categories in our own, native sound system. Sounds that share the same constellations of gestures are quite common across languages, so it’s no mystery that Japanese listeners can hear an English /m/ sound just as well as English native speakers can. There are also those sounds that are totally foreign, sharing almost no gestures at all with native speech sounds. English speakers are relatively good at perceiving the difference between a /t/ sound and a click sound, like those present in the Xhosa language of South Africa—although they may have a tougher time discriminating between the 18 different clicks in Xhosa. A non-native sound is either ① assimilated to an existing, native category, ② taken as a non-native speech sound, or ③ heard as a non-speech sound (like choking, tapping, or snapping fingers). Combinations of these categories tell us how easy or difficult it should be for a non-native listener to perceive the difference between any two sounds in a foreign language:

  • Two-Category Assimilation (TC)—the listener hears both sounds as different sounds in their native language, like /m/ versus /n/. These differences are quite easy to hear, since the task is basically the same as differentiating two sounds that the listener knows in their native language.
  • Category-Goodness Difference (CG)—both sounds fit into the same category in the native sound system, but one is a better fit than the other. English uses a [k] sound with a bit of aspiration in it, but doesn’t use an ejective. An English speaker hearing a language that distinguishes the two sounds should be moderately good at distinguishing these sounds, because they’re clearly different, even though they may sound like a single sound in the L1.
  • Single-Category Assimilation (SC)—both sounds could be heard as the same phoneme, but neither is really a good fit, and neither is particularly ‘better’ than the other. Japanese listeners usually hear English as /ɹ/ and /l/ as strange or poor examples of /ɾ/, and therefore they are fairly bad at discriminating the two sounds.
  • Both Uncategorizable (UU)—the two sounds are very clearly foreign and don’t belong to a native language category. Xhosa clicks, for example, sound nothing like the consonants that we find in English. Sound distinctions in this category can range from really good to awful, depending on the actual sounds in question. I find it pretty easy to hear the difference between Xhosa’s /ǁ/ and /!/ sounds, but I absolutely cannot hear the difference between the plain [kǁ] and the aspirated [kǁʰ] sounds.
  • Nonassimilable (NA)—both sounds are heard as non-speech sounds. If you had never heard of Xhosa before, you might hear the sounds above as tapping or popping sounds, in which case you’d probably be pretty good at discriminating them.

PAM has been backed up over the years by an awful lot of studies into the phonological systems of second language speakers. The categories of distinctions that PAM predicts really do seem to match the perceptions of real live listeners. I’m not entirely convinced of the Direct Realist view of perceptions, but leaving the philosophical underpinnings aside, PAM has been really good at providing a framework for understanding second language phonology.

Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange, Speech perception and linguistic experience: issues in cross-language research (pp. 171-204). Timonium, MD: York Press.

Next up, I’m going to look at another of the big theories in second language phonology: Jim Flege’s Speech Learning Model.


English teacher, student of Japanese, and aspiring linguist.

Tagged with: , , , ,
Posted in Paper Summaries, Phonetics and Phonology
5 comments on “The Perceptual Assimilation Model
  1. locksleyu says:

    Long time no see. Interesting topic to post about. I haven’t studied these things in detail, but I feel that the Representationalist is closer to what actually happens.

    I admit it’s my technical background that has influenced me, but I feel that essentially the conscious part of our brains receives ‘input’ that comes from the actual organs (the ‘hardware’) and then is processed by other areas of the brain (‘modules’) that make it easier to understand. I don’t know if there is any hierarchical system for this, but it works pretty well for digital computing and it makes alot of sense to me.

    Though I am not into drugs at all, the fact that LSD can make one see things that don’t exist seems to me to provide a strong counter argument against Direct Realism.

    Even if the multi-tiered model is close to correct, that doesn’t mean we can’t adjust the lower ‘modules’, it just means we need to know tricks on how to do that.

    • gengojeff says:

      Thanks for the comment! I think we approach the topic from a similar background—digital computing and Information Theory—so I pretty well agree with you.

      I can’t pretend to understand the philosophy behind it completely, but I’ll attempt it. Best writes that “…neither does direct realism imply infallible or obligatory perceptual awareness of all properties of material objects that are in the presence of the observer. Fallibility and, indeed, awareness itself are characteristics not of the objects we perceive, but rather of the acts by which we perceive them. Such acts refer to real actions of the perceiver; for for example, exploratory movements of the eyes and hands. They do not refer to inferences and indirect mental processes. Given these characterizations, the mere occurrence of hallucinations, mirages, illusions, misperceptions, and individual variations in perception cannot refute the current conceptualization of direct realism.”

      It seems to me that individual differences in perception and the existence of illusions certainly challenge direct realism, if they don’t refute it. Representationalist models at least try to account for those things, but it seems to me that direct realism just says, “the perceptions are real even if they’re inaccurate”, which doesn’t actually solve the problem. Why are they inaccurate? Are those inaccuracies down to flaws in the perceptual organs and the brain functions that decode them? If so, I don’t see that it makes much of a difference to say that the problem lies in the hardware rather than the software, so to speak. It’s still a flawed system, and I don’t think you can build a theory of speech perception that on the one hand says our perceptions are accurate enough to distinguish perturbations and constrictions in the vocal tract well enough to reproduce them, and then on the other hand say that they’re inaccurate enough that monolingual English speakers can study their whole lives and never learn to distinguish Korean aspirated/tense consonant distinctions.

      I personally think it’s more likely that we start life with very accurate perceptions for speech, and gradually build up a set of ‘listening algorithms’ that tell us what to pay attention to and what to ignore. Like a compression algorithm, if you like. The actual perceptual organs do get less sensitive with age—in the case of hearing, we lose higher frequencies over the years—but we’re still physically capable of hearing important distinctions for most of our lives. Phonology is the software we run to compress the signal and make speech perception faster, but we can learn to listen to the speech signal as-is and build a new set of phonology if we want to—we can unpack the MP3 and listen to the WAV file directly, to continue the analogy. Certainly a small army of phoneticists has learned to differentiate all the sounds on the IPA chart without a whole lot of difficulty, and last I checked they weren’t handing out Super Soldier Serums during Intro to Linguistics lectures.

      I may be misinterpreting the theory a bit, though. If direct realism allows for that kind of an ‘abstraction layer’ over the existing perceptions, then I guess we see eye to eye, but I think the ‘direct’ part disallows that.

  2. António Macedo says:

    Hi, gengojeff. I really like your simple down-to-earth explanation of this very complex topic, but I’d just like to inform you that althogh your explanations for Category-Goodness Difference (CG) and Single-Category Assimilation (SC) are well described, they are backwards. Cheers.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: