The Ear Pinna and Realism in Music Reproduction
Commentary
Ralph Glasgal
April 1999

Those fluted, rather grotesque, protuberances that extend out from each ear canal are called pinnae. The importance of satisfying one's pinnae by reproducing sound fields that complement their complex nature cannot be exaggerated. Demonstrations in the early fifties of live-versus-recorded sound were spectacularly successful because all the sound, reaching the audience in a real concert hall, came from an appropriate direction, including the ambience free direct sound from the stage loudspeakers and of course all the ambient sound produced by the hall itself. Indeed G.A. Briggs, of Wharfdale fame, showed that in a hall as large as Carnegie Hall, even stereo was not essential to provide pinna pleasure. Like fingerprints, no two individuals have exactly identical ear pinnae. Thought to be vestigial, even as late as the mid 20th century, the intricacy which characterizes these structures would suggest that their function must not only be very important to the hearing mechanism but also that their working must be of a very complex, personal and sensitive nature. For audiophiles in search of more realistic sound reproduction, an understanding of how the pinna, head, and torso interact with stereophonic or surround-sound fields is of importance since at the present time a major mismatch exists. Repairing the discrepancy between what the present playback methods deliver and what the human ear pinnae expect and require is the last major psychoacoustic barrier to be overcome, both in hi-fi music playback and in the hot pc multi-media field. All such applications are covered by an audio engineering discipline known as Auralization.

Auralization Theory and Its Ambiophonic Subset

Auralization is the process of generating or regenerating an imaginary or an existing acoustic sound field of an audible source in a defined space by mathematical modeling or direct recording and then making this field audible in such a way as to duplicate the binaural listening experience a listener would have had at a specific location in that original space. As live music enthusiasts rather than seekers after virtual computer reality, we are primarily concerned with only part of the general auralization problem, namely the recreation of horizontal-staged-acoustic, usually musical, events recorded in enclosed spaces such as concert halls, opera houses, pop venues, etc., where the listening position is centered, fixed and usually close to the stage. I have called this two-channel subset of the broader auralization problem Ambiophonics because it is both related to and a suitable replacement for stereophonics. Another way of stating a major goal of Ambiophonics and describing a still, unsolved problem of virtual reality or surround auralization is the externalization of the binaural earphone effect. In brief, this means duplicating the full, everyday binaural hearing experience, either via earphones, without having the sound field appear to be within one's head, or via loudspeakers, without losing either its directional clarity or the "cocktail party" effect whereby one can focus on a particular conversation despite noise or other voices. So far this goal has eluded those researchers trying to externalize the binaural effect over a full sphere or circle, but it can be done using Ambiophonic methods for the front half of the horizontal plane.

Externalizing the Binaural Effect

It is intuitively obvious, as mathematicians are fond of observing, that duplicating the binaural effect at home, simply involves presenting at the entrance of the home ear canal an exact replica of what the same ear canal would have been presented with at the live music event. But to get to the entrance of the ear canal, almost all sound over about 2khz must first interact with the surface of the pinna. The pinna of your ear is in essence your own personal pseudo-random, multi-frequency, multi-directional, encoder or acoustical notch filter. The pinna of my ear has a quite different (and undoubtedly superior) series of nulls and peaks than does yours. The sound that finally makes it to the ear canal, in the kilohertz region, is subject to severe attenuation or boost, depending on the angle from which the sound originates as well as on its exact frequency. Additionally, sounds that come from the remote side of the head are subject to additional delay and filtering by the head and torso and this likewise very individual characteristic is called the Head-Related Transfer Function or HRTF. In this article when I refer to pinna, this should be understood to include the shadowing, reflection, and diffraction due to the head and torso, and the resonances in the pinna cavities, particulary the large bowl known as the concha. The effects of the head and torso become appreciable starting at frequencies around 500 Hz with the pinna becoming active over 1500 Hz. Because the many peaks and nulls are very close together and sometimes very narrow it is exceedingly difficult to make measurements using human subjects, and not every bit of fine structure can be captured, particularly at the higher frequencies where the interference pattern is very hard to resolve. Because the peaks or nulls are so narrow and also because a null at one ear is likely to be something else at the other ear, we do not hear these dips as changes in timbre or a loss or boost of treble response, but as we shall see the brain relies on these otherwise inaudible serrations to determine angular position with phenomenal accuracy.

Much research has been devoted to trying to find an average pinna response curve and an average HRTF that could be used to generate virtual reality sound fields for military and commercial use in computer simulations, games, etc. So far no average pinna-HRTF emulation program has been found that satisfies more than a minority of listeners and none of these efforts is up to audiophile standards. Remember that a solution to this problem must take into account the fact that each of us has a different pattern of sound transference around, over and under the head, as well as differing pinna.

The moral of all this is that if you are interested in exciting, realistic sound reproduction of concert hall music, it does not pay to try to fool your pinna. If a sound source on a stage is in the center, then when that sound is recorded and reproduced at home it had better come from speakers that are reasonably straight ahead and not from nearby walls, or multi-channel, surround or Ambisonic speakers. The traditional equilateral stereophonic listening triangle is quite deficient in this regard. It causes ear-brain image processing confusion for central sound sources because although both ears get the same full range signal telling the brain that the source is directly ahead, the pinnae are simultaneously reporting that there are higher frequency sound sources at 30=B0 to the left and at 30=B0 to the right. All listeners will hear a center image under these conditions, which is why stereophonic reproduction has lasted 64 years so far, but almost no one would confuse this center image with the real thing. Unfortunately a recorded discrete center channel and speaker is of little help in this regard. We will see later that such a solution has its own problems and is an unnecessary expense that does nothing for the existing unencoded two-channel recorded library.

Single Pinna Phenoma

A very simple experiment demonstrates the ability of a single pinna to sense direction in the front horizontal plane at higher frequencies. Set up a metronome or have someone tap a glass, run water, or shake a rattle about ten feet directly in front of you. Close your eyes and locate the sound source using both ears. Now block one ear as completely as possible and estimate how far the apparent position of the sound has moved in the direction of the still-open ear. Most audio practioners would expect that a sound that is only heard in the right ear would seem to come from the extreme right, but you will find that in this experiment the shift is only some ten or twenty degrees, and if you have great pinnae the source may not move at all. This is one case where the pinna directional detecting system is stronger than the intensity stereo effect and explains why one-eared individuals can still detect sound source position. It also explains why matrix or vector addition methods, such as Ambisonics, which rely on addition or cancelation in the vicinity of the head at frequencies in excess of 1000Hz are just not good enough.

Martin D. Wilde, in his paper, "Temporal Localization Cues and Their Role in Auditory Perception" AES Preprint 3798, Oct., 1993 states

There has been much discussion in the literature whether human localization ability is primarily a monaural or binaural phenomena. But interaural differences cannot explain such things as effective monaural localization. However, the recognition and selection of unique monaural pinna delay encodings can account for such observed behaviour. This is not to say that localization is solely a monaural phenomena. It is probably more the case that the brain identifies and makes estimates of a sound's location for each ear's input alone and then combines the monaural results with some higher-order binaural processor.

Again, any reproduction system that does not take into account the sensitivity of the pinna to the direction of incidence will not sound natural or realistic. Two-eared localization is not superior to one-eared localization, they must both agree at all frequencies for realistic concert hall music reproduction.

Phantom Images at the Side

A phantom front center image can be generated by feeding identical in-phase signals to speakers at the front left and front right of a forward facing listener. The surround sound crowd would be ecstatic if they could produce as good a phantom image, to the side, in the same simple way, by feeding identical in-phase signals just to a right front and a right rear speaker pair. Unfortunately, phantom images cannot be panned between side speakers the way they can between front speakers without involving the other ear through speakers operating under Ambisonic or other interaural coding scheme or by using dynamic, individualized pinna and head equalization. The reason realistic phantom side images are difficult to generate is that we are largely dealing with a one-eared hearing situation. Let us assume that for a right side sound only negligible sound reaches the remote left ear. We already know that the only directional sensing mechanism a one-eared person has for higher frequency sound is the pinna convolution mechanism. Thus if a sound comes from a speaker at 45 degrees to the front, the pinna will locate it there. If, at the same time, a similar sound is coming from 45 degrees to the rear, one either hears two discrete sound sources or one speaker predominates and the image hops backward and forward between them. Of course, some sound does leak around the head to the other ear and depending on room reflections, this affects every individual differently and unpredictably.

The sensitivity of the ears, even when working independently, to the direction from which a sound originates, mandates that to achieve realistic Ambiophonic auralization, all signals in the listening room must originate from directions that will not confuse the ear-brain system. Thus if a concert hall has strong early reflections from 55 degrees (as the best halls should) then the home reproduction system should similarly launch such reflections from approximately this direction. In the same vein, much stage sound, particularly that of soloists, originates in the center twenty degrees or so more often than at the extremes. Thus it makes more sense to move the front-channel speakers to where the angle to the listening position is on the order of five to fifteen degrees instead of the usual thirty. This eliminates most of the pinna angular position distortion but does limit the maximum perceived stage width to about 120=B0, which is double the normal stereo stage-image width. Remember that in an Ambiophonic sound field a slightly narrowed stage is simply equivalent to moving back a few rows in the auditorium and has not proven to be noticeable with most recordings. In the same vein, simulated or recorded early reflections or reverberant tails from the sides or rear of a concert hall should either not come to the ears from the front main speakers at all or should be kept at as low a level as possible.

Pinna Considerations in Binaural or Stereo Recording

The pinna must be taken into account when recordings are made, particularly recordings made with dummy heads. For example, if a dummy-head microphone has molded ear pinnae then such a recording will only sound exceptionally realistic if played back through earphones that fit inside the ear canal. Even then, since each listener's pinnae are different from the ones on the microphone, most listeners will not experience an optimum binaural effect. On the other hand, if the dummy head does not have pinnae, then the recording should either be played back Ambiophonically, using loudspeakers, or through earphones that stand out from the ears far enough to excite the normal pinna effect. (As in the IMAX system, loudspeakers can then be used to provide the lost bass.)

But one must also take into account the head-related effects as well. Thus if one uses a dummy head microphone without pinnae, then listening with loudspeakers or off-the-head earspeakers will produce image distortion, due to the doubled transmission around, over and under both the microphone head and the listener's head. Even if we go back to a microphone with pinnae, and use in-the-ear-canal phones a particular listeners HRTF is not likely to match that of the dummy head. Until a personalized binaural system is created, binaural recordings for earphone-only listening are not likely to fulfill their promise. Again, one alternative used in the Sony IMAX system is to use off-the-ear earphones and loudspeakers simultaneously. The surround loudspeakers provide the personal pinna response and HRTF cues for both front and rear sounds while the earphones take care of the intra-aural part of the field or perhaps more importantly, insure that even listeners, with theater seats off center, still hear an image that matches the action. This method is great for applications where 360-degree direct-sound sources need to be reproduced, as in movies. But as we shall see, IMAX, as well as other costly surround sound methods, are both unnecessary and even counterproductive when reproducing staged musical events.

Pinna Foolery or Feet of Klayman

Arnold Klayman (SRS, NuReality) has gamely tackled the essentially intractable problem of manipulating parts of a stereo signal to suit the angular sensitivity of the pinna, while still restricting himself to just two loudspeakers. To do this, he first attempts to extract those ambient signals in the recording that should reasonably be coming to the listening position from the side or rear sides. There is really no hi-fi way to do this, but let us assume, for argument's sake, that the difference signal (l-r) is good enough for this purpose, particularly after some Klayman equalization, delay and level manipulation. This extracted ambient information, usually mostly mono by now, must then be passed through a filter circuit that represents the side pinna response for an average ear. Since this pinna-corrected ambience signal is to be launched from the main front speakers, along with the direct sound, in theory, these modified ambience signals should be further corrected by subtracting the front pinna response from them. The fact that all this legerdemain produces an effect that many listeners find pleasing is an indication that the pinnae have been seriously impoverished by Blumlein stereo for far too long, and is a tribute to Klayman's extraordinary perseverance and ingenuity.

While Klayman's boxes cost relatively little and are definitely better than doing nothing at all about pinna distortion, any method that relies on average pinna response or, like matrixed forms of surround sound, attempts to separate early reflections, reverberant fields or extreme side signals from standard or matrixed stereo recordings of music is doomed to only minor success. The Klayman approach must also consider that an average HRTF is also required and should be used when moving side images to the front speakers. Someday we will all be able to get our own personal pinna and HRTF responses measured and stored on CD-ROM for use in Klayman type-synthesizers, but until then, the bottom line, for audiophiles, is that the only way to minimize pinna and head-induced image distortion is to give the pinnae what they are listening for. This means launching all signals as much as is feasible from the directions nature intended and requires that pure ambient signals such as early reflections and hall reverberation (uncontaminated with direct sound) come from additional speakers, appropriately located. It implies that recorded ambient signals inadvertently coming from the front channels have not been enhanced to the point where the anomaly of rear hall reverb coming strongly from up front causes subconscious confusion. (Most CDs and LPs are fine in this regard but would be improved by a more Ambiophonic recording style.) It means that strong room reflections that allow almost undelayed direct sound to hit the listener from the wrong angle, or allow early reflections to come from the sides, the ceiling, the floor or the rear wall, have been eliminated through inexpensive and simple room treatment and/or the use of focused (point source or collimated) loudspeakers. Finally it means moving the left and right main loudspeakers much closer together, as discussed above.

Two-Eared Pinnae Effects

So far we have been considering single ear and head response effects. Now we want to examine the even more dramatic contribution of both pinnae and the head, jointly, to the interaural hearing mechanism that gives us such an accurate ability to sense horizontal angular position. William B. Snow, a one-time Bell Telephone Labs researcher, in 1953, and James Moir in Audio Magazine, in 1952, reported that for impulsive clicks or speech and, by extension, music, differences in horizontal angular position as small as one degree could be perceived. For a source only one degree off dead ahead we are talking about an arrival-time difference between the ears of only about ten microseconds and an intensity difference just before reaching the ears so small as not to merit serious consideration. Moir went even further and showed that with the sound source indoors (even at a distance of 55 feet!), and using sounds limited to the frequency band over 3000 hz, that the angular localization got even better, approaching half a degree. It appears that when it comes to the localization of sounds like music, the ear is only slightly less sensitive than the eyes in the front horizontal plane.

It is not a coincidence that the ear is most accurate in sensing position in the high treble range, for this is the same region where we find the extreme gyrations in peaks and nulls due to pinna shape and head diffraction. This is also the frequency region where interaural intensity differences have long been claimed to govern binaural perception. However, it is not the simple amplitude difference in sound arriving at the outer ears that matters, but the difference in the sound at the entrance to the ear canal after pinna convolution (another favorite term of the auralization fraternity). Going even further, at frequencies in excess of 2000 hz it is not the average intensity that matters but the differences in the pattern of nulls and peaks between the ears that allow the two-eared person to locate sounds better than the one-eared individual. Remember that at these higher audible frequencies, direct sounds bouncing off the various surfaces of the pinna add and subtract at the entrance to the ear canal. This random and almost unplottable concatenation of hills and deep valleys is further complicated by later but identical sound that arrives from hall (but hopefully not home) wall reflections or from over, under, the front of, or the back of the head. This pattern of peaks and nulls is radically different at each ear canal and thus the difference signal between the ears is a very leveraged function of both frequency and source position. In their action a pair of pinnae are exquisitely sensitive mechanical amplifiers that convert small changes in incident sound angles to dramatic changes in the fixed unique, picket fence, patterns that each individual's brain has learned to associate with a particular direction. Another way of describing this process is to say that the pinna converts small differences in the angle of sound incidence into large changes in the shape of complex waveforms by inducing large shifts in the amplitude and even the polarity of the sinewave components of such waveforms. (Martin D.Wilde, see above, also posits that the pinna generate differential delays or what amount to reflections or echos of the sound reaching the ear and that the brain is also adept at recognizing these echo patterns and using them to determine position. Since such temporal artifacts would be on the order of a few microseconds it seems unlikely that the brain actually make use of this time delay data.)

Depth and Angular Perception at Higher Frequencies

To put the astonishing sensitivity of the ear in perspective, a movement of one degree in the vicinity of the median plane (the vertical plane bisecting the nose) corresponds to a differential change in arrival time at the ears of only 8 microseconds. Eight microseconds can be compared to a frequency of 120,000Hz or a phase shift of 15=B0 at 5kHz. I think we can all agree that the ear-brain system could not possibly be responding to such differences directly. But when we are dealing with music that is rich in high-frequency components, a shift of only a few microseconds can cause a radical shift in the frequency location, depths, and heights of the myriad peaks and nulls generated by the pinnae in conjunction with the HRTF. To repeat, it is clear that very large amplitude changes extending over a wide band of frequencies at each ear and between the ears can and do occur for small source or head movements. It is these gross changes in the fine structure of the interference pattern that allow the ear to be so sensitive to source position.

Thus, just considering frequencies below 10kHz, at least one null of 30db is possible for most people at even shallow source angles, for the ear facing the sound source. Peaks of as much as 10db are also common. The response of the ear on the far side of the head is more irregular since it depends on head, nose and torso shapes as well as pinna convolution. One can easily see that a relatively minute shift in the position of a sound source could cause a null at one ear to become a peak while at the same time a peak at the other ear becomes a null resulting in an interaural intensity shift of 40db! When we deal with broadband sounds such as musical transients, tens of peaks may become nulls at each ear and viceversa, resulting in a radical change in the response pattern, which the brain then interprets as position or realism rather than as timbre.

In setting up a stereo listening system, it is not possible to achieve a realistic concert hall sound field unless the cues provided by the pinnae at the higher frequencies match the cues being provided by the lower frequencies of the music. When the pinna cues don't match the interaural low frequency amplitude and delay cues, the brain decides that the music is canned or that the reproduction lacks depth, precision, presence, and palpability or is vague, phasey, and diffuse. But even after insuring that our pinnae are being properly serviced, other problems are inherent in the old stereo or new multi-channel surround-sound paradigms. We must still consider and eliminate the psychoacoustic confusion that always arises when there are two or more spaced loudspeakers delivering information about the same stage position but communicating with both pinnae and both ear canals. We must deal with non-pinna induced comb-filter effects and the stage-width limitations still inherent in these modalities even after 64 years. But this is a subject for other web pages.