Studies on speech perception have tended to focus on sound as the only cue. In speech sciences, we usually talk about phonemes in terms of sounds, and we think of the ‘speech signal’ as being acoustic, mainly. But sound isn’t the only cue that listeners get! There’s a whole wide world of visual information available: movements in mouth, gestures, eye contanct, and so on. This study examines the effect of video on the perception of sounds from another language, and comparing the perceptions of listeners with and without footage of the speaker’s mouth moving. This is almost exactly what I tried to do in my MA thesis, although I was looking more at training effects.
Navarra and Soto-Faraco have a good amount of theoretical backing for the idea that mouth movements ought to be helpful, and their experimental design is really clean and understandable. It doesn’t seem to have a ton of citations, but for my research, this paper is about as close to an ideal source as I could get!
It’s pretty reasonable to think that mouth movements should be helpful in perceiving speech sounds, but why should that be, exactly. One of the very popular models of speech perception, the “motor theory”, argues that listeners translate the sounds they hear into movements in the tongue, lips and jaw, and the awareness of those movements allows them to reproduce the sounds they hear. Another, more recent model, the Fuzzy Logical Model of Perception, says that speech perception is really more about combining a few different signals, visual and auditory, to make up a stream of speech. But the idea hasn’t been much applied to second languages.
In this experiment, Navarra and Soto-Faraco gathered up Spanish/Catalan bilinguals, and had them listen to sounds that are separate in Catalan but the same in Spanish. Spanish has just five vowels, and only one mid-front vowel, /e/. Catalan differentiates /e/ and /ɛ/, with /e/ a little higher up than /ɛ/. Spanish-dominant bilinguals heard them all as a Spanish /e/ sound without video, but with the addition of video, they became much more capable of distinguishing the two Catalan vowel sounds.
What does that mean for second language phonology? Well, it challenges the idea that learners can’t tell similar sounds apart. Under Best’s Perceptual Assimilation Model, this distinction ought to be one of the more difficult ones to make, since learners are asked to divide one category in their native language into two separate categories. Video gives them the ability to divide them, so the addition of that extra signal must make the difference a lot more salient. As Flege’s Speech Learning Model points out, learners need to be able to differentiate at least some of the acoustic differences to start making new categories, but perhaps in this case video is able to support learners when they can’t pick up the acoustic qualities that differentiate two sounds.
As for my own research, I’m planning on looking at something that the others bring up in the very last line of the paper:
In practical terms, however, what remains an important implication of our results is that they demonstrate the critical contribution that visual speech gestures can have in accurately perceiving, and possibly acquiring, second languages.
Navarra and Soto-Faraco have demonstrated that video helps learners perceive new sounds, at least between Spanish and Catalan. What remains to be seen is whether a training program that uses video can help learners acquire the sounds more rapidly, accurately, or easily that audio-only training. I’m working on the training app now, so wish me luck!
 Liberman, A. M.; Cooper, F. S.; Shankweiler, D. P.; Studdert-Kennedy, M. (1967). “Perception of the speech code”. Psychological review 74 (6): 431–461.
 Massaro, D. W. (1987). Speech perception by ear and eye. In B. Dodd& R. Campbell (Eds.), Hearing by eye: The psychology of lip-reading (pp. 53–83). London, England UK: Lawrence Erlbaum Associates, Inc.