Realism in Music Reproduction

Realism in Music Reproduction
Ralph Glasgal
June 1999

Ever since 1881 when Clément Ader ran signals from ten (likely randomly) spaced pairs of telephone carbon microphones clustered on the stage of the Paris Opera via phone lines to single telephone receivers in the Palace of Industry that were listened to in pairs seemingly spaced more by accident than design, practitioners of the recording arts have been striving to reproduce a musical event taking place at one location and time at another location and time with as little loss in realism as possible. While judgments as to what sounds real and what doesn’t may vary from individual to individual, and there are many who religiously hold that realism is not the proper concern of audiophiles, such views of our hearing life should not be allowed to slow technical advances in the art of realistic auralization that listeners may embrace or disdain as they please. In this article we will review some past and recent developments in this field and indicate areas where more could readily be accomplished by recording engineers, manufacturers and audiophiles using existing tools.

. . . in many cases the home experience can now exceed a live event in acoustic quality

What Is Realism In Sound Reproduction?

In this article, realism in staged music sound reproduction will usually be understood to mean the generation of a sound field realistic enough to satisfy any normal ear-brain system that it is in the same space as the performers, that this is a space that could physically exist, and that the sound sources in this space are as full bodied and as easy to locate as in real life. Realism does not necessarily equate to accuracy or perfection. Achieving realism does not mean that one must slavishly recreate the exact space of a particular recording site. For instance, a recording made in Avery Fisher Hall but reproduced as if it were in Carnegie Hall is still realistic, even if inaccurate. It is doubtful that any home reproduction system will be able to outperform a live concert in a hall the caliber of Boston’s Symphony Hall, but in many cases the home experience can now exceed a live event in acoustic quality. For example, a recording of an opera made in a smallish studio, can easily be made to sound better at home, using the methods described below, than it did to most listeners at a crowded recording session. One can also argue that a home version of Symphony Hall, where one is apparently sitting tenth row center, is more involving that the live experience heard from a rear side seat in the balcony with obstructed visual and sonic view.

In a similar vein, realism does not mean perfection. If a full symphony orchestra is recorded in Carnegie Hall but played back as if it were in Carnegie Recital Hall, one may have achieved realism but certainly not perfection. Likewise, as long as localization is as effortless and as precise as in real life, the reproduced locations of discrete sound sources usually don’t have to be exactly in the same positions as at the recording site to meet the standards of realism discussed here. (Virtual Reality applications, by contrast, often require extreme accuracy but realism is not a consideration.) An example of this occurs if a recording site has a stage width of 120° but is played back on a stage that seems only 90° wide. What this really means in the context of realism is that the listener has moved back in the reproduced auditorium some fifteen rows, but either stage perspective can be legitimately real. Finally, being able to localize a stage sound source in a stereo, surround sound or Ambisonics system does not guarantee that such localization will sound real. For example, a soloist reproduced entirely via one loudspeaker is easy to localize but almost never sounds real.

Reality Is In The Ear Of The Behearer

While it is always risky to make comparisons between hearing and seeing, I will live dangerously for the moment. If from birth, one were only allowed to view the world via a small black and white TV screen, one could still localize the position of objects on the video screen and could probably function quite well. But those of us with normal sight would know how drab, or I would say unrealistic, such a restricted view of the world actually was. If we now added color to our subject’s video screen, the still grossly handicapped (by our standards) viewer would marvel at the previously unimaginable improvement. If we now provided stereoscopic video, our now much less handicapped viewer would wonder how he had ever functioned in the past without depth perception or how he could have regarded the earlier flat monoscopic color images as being realistic. Finally, the day would come when we removed the small video screens and for the first time our optical guinea pig was able to enjoy peripheral vision and the full resolution, contrast and brightness that the human eye is capable of and fully appreciate the miracle of unrestricted vision. The moral of all this is that only when all the visual sense parameters are provided for, can one enjoy true visual reality. At the present time there is no visual recording or display system that any human being could mistake for the real thing, but the IMAX system is a tantalizing foretaste of what might soon be possible.

One can only achieve realism if all the ear’s expectations are simultaneously satisfied.

Since most of us are quite familiar with what live music in an auditorium sounds like, we can sense unreality in reproduction quite readily. But in the context of audio reproduction, the progression toward realism is similar to the visual progression above. To make reproduced music sound fully realistic, the ears, like the eyes, must be stimulated in all the ways that the ear-brain system expects. Like the visual example, when we go from mono to stereo to matrix surround to Ambisonics to multi-channel discrete, etc.(listed in order of increasing accuracy, assuming that a new multi-channel method will actually emerge that can outperform Ambisonics or as discussed below Ambiophonics) we marvel at each improvement, but since we already know what real concert halls sound like, we soon realize that something is missing. What is usually missing is completeness and sonic consistency. One can only achieve realism if all the ear’s expectations are simultaneously satisfied. If we assume that we know exactly how all the mechanisms of the ear work, then we could conceivably come up with a sound recording and reproduction system that would be quite realistic. But if we take the position that we don’t know all the ear’s characteristics or that we don’t know how much they vary from one individual to another or that we don’t know the relative importance of the hearing mechanisms we do know about, then the only thing we can do, until a greater understanding dawns, is what Manfred Schroeder suggested over a quarter of a century ago, and deliver to the remote ears a realistic replica of what those same ears would have heard when and where the sound was originally generated.

Four Methods Used To Generate Reality At A Distance

Audio engineers have grappled with the problem of recreating sound fields since the time of Alexander Graham Bell. The classic Bell Labs theory suggests that a curtain, in front of a stage, with an infinite number of ordinary microphones driving a like curtain of remote loudspeakers can produce both an accurate and a realistic replica of a staged musical event and listeners could sit anywhere behind this curtain, move their heads and still hear a realistic sound field. Unfortunately, this method, even if it were economically feasible, fails on the first two counts with any finite number of speakers. Such a curtain can act like a lens and change the direction or focus of the sound waves that impinge on it. Like lightwaves, sound waves have a directional component that is easily lost in this arrangement either at the microphone, the speaker or both places. Thus each radiating loudspeaker in practice represents a new discrete sound source of uncontrolled directionality, communicating directly with both ears and therefore generating comb filter interference patterns and pinna directional distortion not present on the live side of the curtain.

Finally this curtain of loudspeakers does not radiate into a concert-hall size listening room and so one would have, say, an opera house stage attached to a listening room not even large enough to hold the elephants in Act 2 of Aida. This lack of opera-house ambience wouldn’t by itself make this reproduction system sound unreal, even if the rest of the field were somehow made accurate, but it certainly wouldn’t sound perfect. The use of speaker arrays (walls of hundreds of speakers) surrounding a relatively large listening area have been shown to be able to synthesize any sound field in a room with remarkable accuracy. But while this technique may be useful in sound amplification systems in halls, theaters or labs, application to the playback of even multi-channel recordings in the home seems doubtful except for the use of speaker arrays at the sides and rear or even overhead to deliver truly diffuse, reconstituted reverberant ambience to the home listener.

In general, multi-channel recording methods or matrix surround systems (Hafler, SQ, QS, UHJ, Dolby, 5.1,etc.) seem like exciting improvements when first heard by long deprived stereo music auditors, but in the end don’t sound real.

The Binaural Approach

A second more practical and often exciting approach is the binaural one. The idea is that, since we only have two ears, if we record exactly what a listener would hear at the entrance to each ear canal at the recording site and deliver these two signals, intact, to the remote listener’s ear canals then both accuracy and realism should be perfectly captured. This concept almost works and could conceivably be perfected, in the very near future, with the help of advanced computer programs, particularly for virtual reality applications involving headsets or near field speakers. The problem is that if a dummy head, complete with modeled ear pinnae and ear canal embedded microphones, is used to make the recording, then the listener must listen with in-the-ear-canal earphones because otherwise the listeners own pinnae would also process the sound and spoil the illusion.

The real conundrum, however, is that the dummy head does not match closely enough any particular human listeners head shape or external ear to avoid the internalization of the sound stage whereby one seems to have a full symphony orchestra (and all of Carnegie Hall) from ear to ear and from nose to nape. Internalization is the inevitable and only logical conclusion a brain can come to when confronted with a sound field not at all processed by the head or pinnae. For how else could a sound have avoided these structures unless it originated inside the skull? If one uses a dummy head without pinnae, then, to avoid internalization, one needs earphones that stand off from the head, say, to the front. But now the direction of ambient sound is incorrect and side localization is not fully accurate. IMAX is an example of this off the ear method, as supplemented with loudspeakers. There is also a circumaural earphone design that places a tiny speaker just over the notch in the lower front part of the ear so that many of the pinna resonances are still normally excited for frontal sounds. A similar ear speaker over the upper rear part of the ear can provide a similar pinna-friendly input for rear originating sounds. Unfortunately, headshape differences between the dummy head and the listeners head remain, and the dummy head should not have modeled pinnae if these earphones are to be used.

The fact that binaural sound via earphones runs into so many difficulties is a powerful indication that individual head shapes and outer ear convolutions are critically important to our ability to sense sonic reality.

Wavefield Synthesis

A third theoretical method of generating both an accurate and a realistic soundfield is to actually measure the intensity and the direction of motion of the rarefactions and compressions of all the impinging soundwaves at the single best listening position during a concert and then recreate this exact sound wave pattern at the home listening position upon playback. This method is the one expounded by the late Michael Gerzon starting in the early 70’s and embodied in the paradigm known asAmbisonics. In Ambisonics, (ignoring height components) a coincident microphone assembly, which is equivalent to three microphones occupying the same point in space, captures the complete representation of the pressure and directionality of all the sound rays at a single point at the recording site. In reproduction, speakers surrounding the listener, produce soundwaves that collectively converge at one point (the center of the listeners head) to form the same rarefactions and compressions, including their directional components, that were recorded.

In theory, if the reconstructed soundwave is correct in all respects at the center of the head (with the listeners head absent for the moment) then it will also be correct three and one half inches to the right or left of this point at the entrance to the ear canals with the head in place. The major advantage of this technique is that it can encompass front stage sounds, hall ambience and rear sounds equally, and that since it is recreating the original sound field (at least at this one point) it does not rely on the phantom image mechanism of Blumlein stereo. On the other hand Ambisonic theory is mute on the subject of how the sounds coming from the various loudspeakers are modified by the ear pinna and the head shape and how a decoder might compensate for these effects.

Thus the Ambisonic method is not easy to keep accurate at frequencies much over 2000 Hz and must and does rely on the apparent ability of the brain to ignore this lack of realistic high frequency pinna, head and waveform localization input and localize on the basis of the easier to reconstitute lower frequency waveforms alone. This would be fine if localization, by itself, equated to realism or we were only concerned with movie surround sound applications.

Other problems with basic Ambisonics include the fact that it requires at least three recorded channels (if we are concerned about quality) and therefore can do little for the vast library of existing recordings. Back on the technical problem side, one needs to have enough speakers around the listener to provide sufficient diversity in sound direction vectors to fabricate the waveform with exactitude and all these speakers positions, relative to the listener, must be precisely known to the Ambisonic decoder. Likewise the frequency, delay and directional responses of all the speakers must be known or closely controlled for best results and as in all other loudspeaker systems the effects of listening room reflections must also be taken into account, or better yet, eliminated.

As you might imagine, it is quite difficult, particularly as the frequency goes up, to insure that the size of the reconstructed field at the listening position is large enough to accommodate the head, all the normal motions of the head, the everyday errors in the listener’s position, and more than one listener. Those readers who have tried to use the Lexicon panorama mode, the Carver sonic hologram or the Polk SDA speaker system, all designed to correct the higher frequency parts of a simple stereo soundfield at the listener’s ear by acoustic cancellation will appreciate how difficult this sort of thing is to do in practice, even when only two speakers are involved.

In my opinion, however, the basic barrier to reality, via any single point waveform reconstruction method, like Ambisonics, is its present inability, as in the binaural case, to accommodate to the effects of the outer ear and the head itself on the shape of the waveform actually reaching the ear canal. For instance, if a wideband soundwave from a left front speaker is supposed to combine with a soundwave from a rear right speaker and a rear center speaker etc. then for those frequencies over say 2500 Hz the left ear pinna will modify the sound from each such speaker quite differently than expected by the equations of the decoder, with the result that the waveform will be altered in a way that is quite individual and essentially impossible for any practical decoder to control. The result is good low frequency localization but poor or non-existent pinna localization. Unfortunately, as documented below, mere localization, lacking consistency, as is unfortunately the case in stereo, surround sound or Ambisonics is no guarantor of realism. Indeed, if we must sacrifice a localization mechanism, let it be the lowest frequency one.

Finally, one can make a case that one can have glorious realism, even without any detailed front stage localization, as long as ambient localization is directionally correct (as anyone who has sat in the last row of the family circle in Carnegie Hall can attest to).


The fourth approach, that I am aware of, I have called Ambiophonics. Ambiophonics, which borrows a little from Binaural and still less from Ambisonics, assumes that there are more localization mechanisms than are dreamed of in the previous philosophies and strives to satisfy all of the mechanisms, as far as is possible. It also takes the psychoacoustic position that absolute binaural positional accuracy, as opposed to absolute realism, is not as vital and furthermore, that this reproduction technology need only be concerned with reproducing staged acoustical musical events, not movies or virtual reality. The advantage of focusing on just one aspect of sonic reality is that this reality is achievable today, is reasonable in cost, and is applicable to existing LPs and CDs.

One basic element in Ambiophonic theory is that it is best not to record rear and side concert-hall ambience or try to extract it later from a difference signal or recreate it via waveform reconstruction, but to synthesize the ambient part of the field using real stored concert hall data to generate ambience signals using the new generation of digital signal processors. The variety and accuracy of such synthesized ambient fields is limited only by the skill of programmers and data gatherers, and the speed and size of the computers used. Thus, in time, any wanted degree of concert hall design perfection could be achieved. A library of the worlds great halls may be used to fabricate the ambient field as has already been done with startling success in the JVC XP-A1010. The number of speakers needed for ambience generation does not need to exceed six or eight (although speaker walls would be optimum) and is comparable to Ambisonics or surround sound in this regard, but even more speakers could be used as this synthesis method is completely scaleable and the quality and location of these speakers is not critical.

Ambiophonics is usually less limited as to the number of listeners who can share the experience at the same time compared to most implementations of other methods using a similar number of speakers. Fortunately, two to five people can be accommodated by Ambiophonics in several of its practical incarnations.

The other basic tenet of Ambiophonics is similar to Ambisonics and that is to recreate at the listening position an exact replica of the original pressure soundwave. However, Ambiophonics does this by transporting the sound source, stage, and hall to the listening room rather than a point wavefront to the ears. In other words, Ambiophonics externalizes the binaural effect, using, as in the binaural case, just two recorded channels but with two front stage reproducing loudspeakers and eight or so ambience loudspeakers in place of earphones. Ambiophonics generates stage image widths up to 120° with an accuracy and realism that far exceeds that of any other 2 channel reproducing scheme. While it hardly seems to be necessary, the use of four channels and four main front loudspeakers can produce a full 180° stage image, (see below) but I doubt the expense would be worth it since I for one have never attended a live performance, and had a seat, where the music came from anything approaching a full half circle.

For reasons outlined below, Ambiophonic reproduction does require that the two main front speakers subtend an angle of only about 10° each side of the listening position so as not to generate the kind of pinna angle distortion for central sounds that phantom-image-stereo, Ambisonic or surround sound speaker placement almost always gives rise to. Ambiophonics also requires that a small lightweight, sound absorbing panel be placed on edge, centered in front of the listening position so as to prevent the left front speaker signal from reaching the right ear and vice versa. While there are electronic means to accomplish this end, (Carver, Lexicon, Cooper-Bauck-Harmon Intl.) and extra speaker means (Polk or easily home made) none of these work as well as a small inexpensive panel. The Ambiophonic listener is free to rotate his head, rock back and forth, and undulate from side to side, without image shift, just as in a concert hall. Most audio enthusiasts imagine that the use of the panel will be objectionable on aesthetic grounds. I certainly wish I could think of a less problematical way to accomplish the same end, but, at least in practice, one gets used to the panel very quickly and soon wonders why anyone listens without one. The new lightweight materials make it easy to store the panel between sessions or provide extras for multiple listeners.

The Stereo Dipole, AES Preprint 4463

For those unalterably opposed to using a panel, Ole Kirkeby, and Philip A. Nelson of The University of Southampton with Hareo Hamada of Tokyo Denki University have developed an electronic version of the panel. They have shown that the ideal speaker spacing for a crosstalk cancellation sytem be it mechanical or electronic is about 10 degrees. They refer to two speakers placed so close together as a “stereo dipole”. The electronic filters required to cancel crosstalk in this arrangement are somewhat easier to design and are more effective since at the narrower angle there is little diffraction around the head for the correction signals and so HRTF correction is not critical. Pinna angle distortion of the correction signals is also not a major factor and so the crosstalk cancellation can be allowed to operate over the full upper frequency range without restricting the size of the listening area or generating the audible phasiness effects that afflict electronic crosstalk cancellation schemes for widely spaced loudspeakers. A simple, low cost, lightweight panel will still remain the best choice for critical listeners.

Since Ambiophonics is a binaural based system, it does not provide the Blumlein loudspeaker crosstalk signal that furnishes the lowest frequency phase shift localization cues for recordings made with a coincident microphone. But to counterbalance this, Ambiophonics, or any crosstalk elimination idea, is more compatible than is standard stereo with the overwhelming majority of non-coincident microphone recording arrangements and the improvement in HF localization more than compensates for any loss in coincident mic LF localization. Furthermore, depending on its size and absorbency, the barrier (and even its electronic cousins) loses its effectiveness at low frequencies thus allowing some crosstalk and therefore amplifying LF phase cues for coincident microphone recordings. One can also move a little further back from the edge of the barrier or use a smaller panel when listening to coincident mic recordings.

As in all realistic systems, room treatment is essential for a good result and I have found that reducing the room reverberation time to less that .2 seconds works well in this context especially if used in combination with very directional, diffraction-free, point-source, front channel loudspeakers as once recommended by Malcolm Hawksford in another context.

Other Contrasts Between Ambiophonics and Ambisonics

The really fundamental difference between Ambisonics and Ambiophonics is that Ambisonics attempts to fabricate the exact compressions and rarefactions, including their intensities and directions at each ear canal, by summing the outputs of a given array of sound emitters whose drive signals must be derived by computation from three f (front), s (side), directional velocity microphone signals and the o omnidirectional pressure signal. (Since most readers will not be familiar with the mathematical symbols used in Ambisonics I will use o instead of w for the omnidirectional signal, f for the front-rear x signal and s for the left-right side y signal. We ignore the ‘h’eight (z) axis signal here.) In theory, there could be a playback computer that was fast enough to process such a three channel mic input with the accuracy needed to produce a perfect spherical wave front of say the fifth degree and the fourth order up to 15kHz. Each user would also have to load his personal pinna response curve into this computer to get the correct waveform at the entrance to the ear canal. Each speaker signal would then be convoluted by the appropriate direction-dependent pinna function. You would also have to place six or more speakers accurately, enter their delay and polar responses into the computer, and do something about room reflections. The recording medium would still require three discrete channels and so this powerful Ambisonic computer would not do much for non-Ambisonic recordings.

So far, by restricting itself to relatively low frequency waveform synthesis, Ambisonics has been able to function at reasonable cost, please its adherents and seem a promising candidate as a standard for 360° surround sound for video or virtual reality applications that require height. In contrast, Ambiophonics does not try to create the exact sound field at the recording site, only one that could exist, that can be reproduced without generating localization contradictions, and one that can be accepted by the brain as real. The stage image heard or the hall ambience that the Ambiophonic computer generates may not be exactly Carnegie Hall or may not always be as the recording engineer remembers, but this system is doable now and works well with most existing recordings, even mono ones. This is not to say that the design of a stored ambience convolution computer is a trivial project, or that it will ever be a very low cost device, but using stored descriptions of existing halls makes the job a lot easier than starting a synthesis program from scratch and the concert hall auralization tools, already used by architects today, could be applied at once to a consumer product.

Also, in contrast to Ambisonics, Ambiophonics does not require a known precise placement of ambience speakers and their polar radiation characteristics are not of critical importance. Remember that small changes in any ambient field are equivalent to small changes in the hall volume, shape, or finishes, or shifts in ones seat, or in the number of people in the audience.

Psychoacoustic Fundamentals Related to Realism in Reproduced Sound

Our problem is how to achieve realistic sound with the psychoacoustic knowledge at hand or suspected. For starters, the fact that separated front loudspeakers can produce centrally located phantom images between themselves is a psychoacoustic fluke that has no purpose or counterpart in nature and is a poor substitute for natural frontal localization. Any reproduction methods that rely on stimulating phantom images, and this includes not only stereo but most versions of surround sound, can never achieve realism even if they achieve localization. Realism also cannot be obtained merely by adding surround ambience to phantom localization. Ambisonics, Binaural, and Ambiophonics do not employ the phantom image mechanism to provide the front stage localization and therefore, in theory, should all sound more realistic than stereo and, in fact, do.

Ambiophonic microphone arrangements could make this approach to realism even more effective, but I am happy to report that Ambiophonics works quite well with most of the microphone setups used in classical music or audiophile caliber jazz recordings. Adding home generated ambience, provides the peripheral sound vision to perfect the experience.

Since our method is to just give the ears everything they need to get real, it is not essential to prove that the pinna (and I usually mean this word to also include the concha, the head and the torso) are more important than some other part of the hearing mechanism, but the plain fact is that they are. To me it seems inconceivable that anyone could assume that the pinna are vestigial or less sensitive in their own frequency domain then the other ear structures are in theirs. As a hunter-gatherer animal, it would be of the utmost importance to sense the direction of a breaking twig, a snake’s hiss, an elephant’s trumpet, a birds call, the rustle of game etc. and probably of less importance to sense the lower frequency direction of thunder, the sigh of the wind, or the direction of drums. The size of the human head clearly shows the bias of nature in having humans extra sensitive to sounds over 700 Hz.

Look at your ears. The extreme non-linear complexity of the outer ear structures, and their small dimensions defies mathematical definition and clearly implies that their exact function is too complex and too individual to understand, much less fool, except in half-baked ways. The convolutions and cavities of the ear are so many and so varied so as to make sure that their high frequency response is as jagged as possible and as distinctive a function of the direction of sound incidence as possible. The idea is that no matter what high frequencies a sound consists of or from what direction a transient sound comes from, the pinnae and head together or even a single pinna alone will produce a distinctive pattern that the brain can learn to recognize in order to say this sound comes from over there.

The outer ear is essentially a mechanical converter that maps discrete received sound directions to preassigned frequency response patterns. There is also no purpose in having the ability to hear frequencies over 10 kHz, say, if they cannot aid in localization. The dimensions of the pinna structures and the measurements by Møller, strongly suggest, if not yet prove, that the pinna do function for this purpose even in the highest octave. Møller’s curves of the pinna and head functions with frequency and direction are so complex that the patterns are largely unresolvable and very difficult to measure using live subjects. Again, it doesn’t matter whether we know exactly how anyone’s ears work as long as we don’t compromise on bandwidth, frequency response, loudness, distortion, and especially source directionality, at all frequencies, during reproduction.

The Evidence For Pinna Localization Priority

The above doesn’t mean that we have to ignore all the research that has preceded us. The literature overwhelmingly supports the view that localization for broadband sounds at frequencies over approximately 1.5 kHz., is based on single pinna, dual pinnae and the HRTF (Head Related Transfer Function) and is stronger, (more accurate is a better word) than the localization ability of the ear at frequencies below say 600 Hz. (In this and my other papers on this subject, I try to use the term HRTF to refer only to head and torso effects that modify sounds before they reach the outer ears.) I believe the references referred to below, even support the notion that localization accuracy is directly proportional to the frequency of complex music-like sounds which goes a long way toward explaining why transient localization is so strong. It also explains the Franssen Effect where sound is localized to the source sounding the transient part of a complex signal that has been broken up into two parts, one the transient and the other the continuing lower frequency sinusoid. See Blauert, Spatial Hearing.

William B. Snow in Basic Principles of Stereophonic Sound, 1953, as reprinted in Stereophonic Techniques, states “for impulsive sounds such as speech or clicks, differences as small as 1° or 2° can be perceived.” He goes on “The intensity differences (at the ears) due to diffraction are functions of frequency and cause a complex sound to have a different frequency-intensity composition or quality at each ear. It is undoubtedly this effect which removes ambiguities in direction because the diffraction effects are so complicated that a given quality difference can correspond only to one direction.” Unfortunately, Snow never used the word, pinna, but he does say “however, in the higher frequency region, intensity differences produced by the diffraction or sound shadow effects of the head and external ears become great enough to give angular localization.”

But you say, Snow does not say one mechanism is stronger than another although his use of the word, clicks, strongly implies this. Fair enough. An earlier bit of research in England by James Moir in Oct. 1952 in Audio Magazine as reprinted in Stereophonic Techniques is even more explicit on this point. (See the complete bibliography now on my web site In his Table Two he reports on the accuracy of location as a function of the frequency band of filtered male speech used as a test signal. For a frequency band of 50 to 500 Hz the average localization error was 3.8°, for 500 to 3000Hz the average error was .9°, and for 3000 to 7000Hz (a rather restricted bandwidth) the average localization error was an astonishingly low .5°. Furthermore, although Moir did not comment on this phenomena his last table entry for 50 to 7000 Hz wideband speech shows a slightly greater error of .7°. One could infer from this result, that in the presence of sufficient high frequency localization cues, the lower frequencies just get in the way.

Don Keele Jr., in AES preprint 2420, Nov.1986, says “We used wide-band pink noise for the input signal in all carrier tests. An interesting phenomena that we observed, was the breakup of the sound image. Changes in amplitude and delay are effective in shifting the image only at certain frequencies: Up to 700Hz, for delay and greater than 2000Hz, for amplitude with the region between 700Hz and 2000Hz effective for both in combination. At times we would perceive the low frequencies staying at the origin and the high frequencies shifting or vice-versa. The soundfield (with barrier) extends much beyond the typical stereo arrangement of 30° to the left and right, however the goal of a 180° soundfield was not met. The amplitude panned data show that the image shifted in direct proportion to the amplitude differential, out to roughly 50° or 60° off axis. The delay panned data is similar in that an image shift limit is found to occur at roughly the same angles. This image shift limit noted in both amplitude and delay panned data could be due to two possible reasons: Imperfect blocking of the crosstalk signal by the barrier and the effect of the ear’s pinna on the frequency response of the received acoustic signal….. In the second case, the barrier-speaker setup generates acoustic signals that always reach the listener coming from directly ahead. The ears are not receiving the correct frequency response cues due to pinna effects, etc. that signals coming from large off-axis angles would have. This means that additional processing to include these effects may be necessary to swing the signals further around to the side.”

Don’t Tolerate Pinna Privation!

In my own experience, the pinna clearly outvote the lower frequency localization senses. Try the experiment outlined at the Ambiophonic web site to test your own pinna power. The internalization of the binaural sound field reproduced with earphones is another good example of the pinna riding rough shod over the interaural delay cues. It is true that the binaural image does spread from ear to ear, but is this accuracy realistic? Blauert in Spatial Hearing, on page 49 confirms the everyday observation that excellent broadband localization is possible even for people totally deaf in one ear. One eared hearing cannot possibly use interaural phase mechanisms or interaural intensity cues. Thus there exists a strong non-interaural localization mechanism that cannot simply be ignored. The ability of pinna equalizing boxes such as the Klayman NuReality device to move images freely, despite the presence of unaltered low frequency cues, is also indicative of pinna power. Why is this relevant to surround sound for music or to Ambisonics? Well, if the pinnae are as important as the literature and I suggest, then the reconstruction of the Ambisonic plane wave must be accurate to beyond 10kHz and this is probably not achievable. Let me quote from Vanderkooy and Banford, Ambisonic Sound For Us, AES Preprint 4138, Oct. 1995. “the benefits of the Ambisonic system rapidly decline with increasing frequency.” Or in Oct. 1987, Vanderkooy and Lipshitz, AES preprint 2554, “We show that it is only in the low frequency regime, below maybe 700Hz that the spatial region within which an Ambisonic system will reasonably-well reconstruct the traveling wave which would correspond to a real acoustic source, is large enough to encompass the head of a central listener.” Similar considerations will likely apply to most of the proposed multi-channel recording standards now being considered, as far as the realistic reproduction of acoustical musical events is concerned.

Ambiophonic Recording for Realism

One can heighten the accuracy, if not gild the lily of realism, of an Ambiophonic reproduction system by taking advantage in the microphone arrangement of the knowledge that in playback, the rear half and side hall ambience will be synthesized, that there is no crosstalk, that listening room reflections are minimized and that the front loudspeakers are relatively close together. For political reasons and as an educational exercise, we can use Ambisonic nomenclature to describe such an arrangement. The sound waves at a given point in space can be completely captured by placing three imaginary microphones simultaneously at that precise spot, again ignoring height. One of these microphones is a pressure microphone whose omnidirectional output (o) is simply proportional to the instantaneous value of all the compressions and rarefactions at that point and moment adding and subtracting. A single unobstructed o microphone signal inherently contains no directional information (although it could if it were baffled). The second microphone is a figure eight (velocity) microphone pointing straight ahead and straight behind. The output of this microphone (f) is amplitude sensitive to the direction from which the soundwave comes and declines to zero in cosine fashion as the sound moves from directly in front to directly at the side. In other words a velocity microphone is a direction-to-amplitude encoder. Such a microphone is similarly sensitive to sounds coming from the rear, but the polarity of such signals is inverted. The third microphone (s) is a second identical figure eight microphone so oriented that it is most sensitive to sounds from the left or right and has zero output for signals dead ahead or dead behind. We will ignore, for the moment, the frequency response and other aberrations of real microphones and the difficulty of actually making three or even two microphones truly coincident by mechanical means alone.

It would be very simple, in theory, to combine Binaural and Ambisonic methods to produce Ambinaural by using two headspaced o,f,s microphones, six recording channels, and one Ambisonic speaker decoder for each ear. Ideally one would want the playback decoder for each ear to be able to completely determine the sound at each ear but, in reality it is almost impossible to prevent one set of speakers from also communicating with the wrong ear but one could use the Ambiophonic panel to isolate the two sets of speakers. Ambiosonics? The decoders in this six channel system would only have to generate a frontal wavefield at one ear from signals in one 90° quadrant, perhaps using three speakers on each frontal side, which is much easier to do than for the general case of an entire circle since the rear half of the ambient field is synthesized in our case. Since all of the decoded signals would be coming from more or less the proper directions the pinna angle distortion problem would not be serious. This is the brute force, cost-no-object, method and indeed the recent ARA proposal, for DVD audio, suggests a six channel recording format but they don’t have anything like Binaurosonics in mind, yet.

Reductio Ad Ambiosurdum

Since it is very difficult to mount three microphones so that they are really coincident, and although there are electronic position shifters that correct for this, we are still left with six outputs when we want Ambiophonics to work with two. The trick is to see how the Ambisonic microphone arrangement can be simplified without sacrificing anything vital to realistic music reproduction. The first thing we should remember is that we do not need to pickup any sound from the rear half circle since that ambience is computer generated. Thus if our microphone is baffled, we can ignore the rear half lobe of the f microphones and the rear halfcircle of the o and s signals. For angles close to the middle, the f and o signals would be very much the same so let us assume that we delete the two f signals altogether. We also know from our earlier discussions that a coincident microphone requires crosstalk in reproduction in order to provide low frequency phase shift differential localization. But the binaural arrangement of the two coincident microphone assemblies eliminates this concern. If we use two coincident s,o microphones, (of course, if two microphone sets are used they cannot be coincident in Ambisonic lingo, but bear with me.) spaced the average distance between the ears, we restore the natural low frequency phase cues. Remember that in Ambisonics, the problem was that if the wave was correct at the center of the head it might not be exact enough at the sides of the head. Well, in our Ambiophonic version of Ambisonics, the right and left ear signals are isolated by the panel or panels extending from the listeners between the speakers, so now we can independently generate a wave for each ear. This means that for the central 90° or so sound sources, the left half of a right side o microphone circle and the right half of a left o circle are not really required. Thus the decoding equations become even simpler.

If the mics were now mounted in a dummy head with pinnae, then the signal would be modified by pinnae squared. (bad) But if the mics were just mounted on a head wide boom, then on playback, since the speakers are close together, there would be no head related response even for signals from the sides. Therefore the best arrangement is to use a dummy head without pinna and place either an o or an o-s pair at the entrance to the dummy head ear canal. The dummy head will of course not match the listeners head exactly, but the effect is slight, since in this case the pinna are not involved and this discrepancy only effects side sources. Someday, one will be able to get his head response measured and correct for the difference between his head and the standard dummy head. Looked at this way, Ambiophonics is just a subset of Ambisonics or a superset of Binaural.

I haven’t forgotten about the s signals. Since the half exposed s microphones point to the extreme front sides, if there is no music there, they are not always needed. Depending on the exact aperture of the o microphones, the shadowing effect of the dummy head, the loudspeaker angle and the lower frequency cues present, stage widths of 120° are obtainable without an s signal. However, if four channels are available and two more front side speakers are used, the stage width can be expanded. In this case interaural crosstalk is not as much of a spoiling factor because, by definition, extreme side sound sources engender neither pinna angle distortion nor loudspeaker crosstalk but there is a likelihood of localization to a single side speaker, and to avoid this a full fledged Ambisonic decoder could be used with a pair of side speakers. Another possibility is to modify the s signal with a pinna equalizer and add it to the o signal so that when it comes from the front speakers it will sound as natural as possible even in the absence of any interaural low frequency cues.. Again, for best results, this would require having your pinna response measured and applying a correction to the hopefully published recording company pinna equalizer. There is at least one company that maintains that 25 pinna response curves describe over 90% of the worlds population and they have put such pinna responses in a box so you can select the one that works best with your own ears. Thus the personal pinna era may be here sooner rather than later.

Nothing New Under the Sun

After completing the above text on recording, I began to see if the recordings that were easy to playback Ambiophonically had anything consistent or unusual about them. Not being a recording engineer or a microphone aficionado, it took me awhile to notice that many of the easy to match CDs were made with something called a Schoeps KFM-6. A picture of this microphone in a PGM Recordings promotional flyer showed a head sized but spherical ball with two omnidirectional microphones one recessed on each side of the ball where ear canals would be if we had an exactly round head. The PGM flyer also included a reference to a paper by Günther Theile describing the microphone, entitled On the Naturalness of Two-Channel Stereo Sound, J. Audio Eng. Soc., Vol 39, No. 10, 1991 OCT.

Although Theile would probably object to my characterization of his microphone, his design is essentially a simplified dummy head without external ears. He states, It is found that simulation of depth and space are lacking when coincident microphone and panpot techniques are applied. To obtain optimum simulation of spatial perspective it is important for two loudspeaker signals to have interaural correlation that is as natural as possible…Music recordings confirm that the sphere microphone combines favorable imaging characteristics with regard to spatial perspective accuracy of localization and sound color… Later he states The coincident microphone signal, which does not provide any head-specific interaural signal differences, fails not only in generating a head-referred presentation of the authentic spatial impression and depth, but also in generating a loudspeaker-referred simulation of the spatial impression and depth……it is important that, as far as possible, the two loudspeaker signals contain natural interaural attributes rather than the resultant listener’s ear signals in the playback room.

One minor problem with the Theile approach remains. For signals coming from the side, the sphere acts as sort of filter for the shorter wavelengths just as the head does. When this side sound comes from side stereo speakers the listeners head acts as a filter again resulting in HRTF squared. The solution is to move the speakers to fifteen degrees in front of the listener use the barrier and listen to the Theile sphere without the second head response function. I have done this and it works. Eventually, a listener would substitute his own pinnaless HRTF for that of the sphere and the accuracy would be further enhanced.

Theile also “generates artificial reflections and reverberation from spot-microphone signals.” He uses the word artificial in the sense that the spot microphone signals will be coming from the front stereo loudspeakers instead of from the rear, the sides, or overhead. While Theile’s results rest as much on empirical subjective opinion as they do on psychoacoustic precepts, they certainly are consistent with the premises of Ambiophonics both in recording and reproduction.

Realistic Reproduction of Depth

It is axiomatic that a realistic music reproduction system should render depth as accurately as possible. Fortunately, front stage distance cues are easier to record and/or simulate realistically than most other parameters of the concert-hall sound field. Assuming that the recording microphones are placed at a reasonable distance from the front of the stage, then the high frequency roll-off due to distance and the general attenuation of sound with distance remain viable distance cues in the recording. Depth of discrete stage sound sources is, however, more strongly evidenced in concert-halls by the amplitude and delay of the early reflections and the ear finds it easier to sense this depth if there is a diversity of such reflections. In Ambiophonics, simulated early reflections from 55° make the stage as a whole seem more interesting, but it is only the recorded early reflections coming from the front speakers that provide the reflections that allow depth differentiation between individual instruments. This is why anechoic recordings sound so flat when played back stereophonically or even Ambiophonically, despite the presence of a simulated ambient field. In ordinary stereo, depth perception will suffer if early side and rear hall reflections wrap around to the front speakers or in the anechoic case, are completely missing. Since it is easy to make Ambiophonic recordings that include just proscenium ambience, why not do so and save on synthesis processing power and preserve, undistorted, the depth perception cues?

There remains the issue of perspective, however. When making a live performance recording of an opera or a symphony orchestra the recording microphones are likely to be far enough away from the sound sources to produce an image at home that is not so close as to be claustrophobic. There are many recordings, however, that produce a sense of being at or just behind the conductors podium. This effect does not necessarily impact realism but you must like to sit in the front row to be comfortable with this perspective. Turning down the volume and adding ambience can compensate for this, but at a loss in realism. This problem becomes more serious in the case of solo piano recordings or small Jazz combos. For example, if a microphone is placed three feet from an eight foot piano, then that piano is going to be an overwhelming close-up presence in the listening room and a “They-Are-Here” instead of a “You Are There” effect is unavoidable. This can be very realistic especially with close together speakers and the Ambiophonic barrier, but adding synthesized hall ambience doesn’t help much since the direct sound is so overwhelming. The major problem with this type of recording is that you have to like having these people so close in a small home listening room. You may notice that demonstrators of high resolution playback systems in show rooms or at shows, overwhelmingly, use small ensemble, solo guitar, single vocalist etc., close mic’ed, recordings to demonstrate the lifelike qualities of their products and that these demonstrations are mostly of the “They Are Here” variety.

To Probe Further or Try It Yourself

Details on setting up an Ambiophonic playback system and other related topics are available at the Ambiophonics Institute web site The book, Ambiophonics, by Keith Yates and Ralph Glasgal is available from, Borders Books & Music, and Barnes & Noble.

  Don’t forget to bookmark us! (CTRL-D)

Be the first to comment on: Realism in Music Reproduction

Leave a Reply

Your email address will not be published.

NanoFlo (81)DR Acoustics (80)DR Acoustics (79)

Stereo Times Masthead

Clement Perry

Dave Thomas

Senior Editors
Frank Alles, Mike Girardi, Key Kim, Russell Lichter, Terry London, Moreno Mitchell, Paul Szabady, Bill Wells, Mike Wright, Stephen Yan, and Rob Dockery

Current Contributors
David Abramson, Tim Barrall, Dave Allison, Ron Cook, Lewis Dardick, Dan Secula, Don Shaulis, Greg Simmons, Eric Teh, Greg Voth, Richard Willie, Ed Van Winkle, and Rob Dockery

Music Reviewers:
Carlos Sanchez, John Jonczyk, John Sprung and Russell Lichter

Site Management  Clement Perry

Ad Designer: Martin Perry