|
Ever
since 1881
when Clément
Ader ran
signals from
ten (likely
randomly)
spaced pairs
of telephone
carbon
microphones
clustered on
the stage of
the Paris
Opera via
phone lines to
single
telephone
receivers in
the Palace of
Industry that
were listened
to in pairs
seemingly
spaced more by
accident than
design,
practitioners
of the
recording arts
have been
striving to
reproduce a
musical event
taking place
at one
location and
time at
another
location and
time with as
little loss in
realism as
possible.
While
judgments as
to what sounds
real and what
doesn't may
vary from
individual to
individual,
and there are
many who
religiously
hold that
realism is not
the proper
concern of
audiophiles,
such views of
our hearing
life should
not be allowed
to slow
technical
advances in
the art of
realistic
auralization
that listeners
may embrace or
disdain as
they please.
In this
article we
will review
some past and
recent
developments
in this field
and indicate
areas where
more could
readily be
accomplished
by recording
engineers,
manufacturers
and
audiophiles
using existing
tools.
.
. . in many
cases the
home
experience
can now
exceed a
live event
in acoustic
quality
What
Is Realism In
Sound
Reproduction?
In
this article,
realism in
staged music
sound
reproduction
will usually
be understood
to mean the
generation of
a sound field
realistic
enough to
satisfy any
normal
ear-brain
system that it
is in the same
space as the
performers,
that this is a
space that
could
physically
exist, and
that the sound
sources in
this space are
as full bodied
and as easy to
locate as in
real life.
Realism does
not
necessarily
equate to
accuracy or
perfection.
Achieving
realism does
not mean that
one must
slavishly
recreate the
exact space of
a particular
recording
site. For
instance, a
recording made
in Avery
Fisher Hall
but reproduced
as if it were
in Carnegie
Hall is still
realistic,
even if
inaccurate. It
is doubtful
that any home
reproduction
system will be
able to
outperform a
live concert
in a hall the
caliber of
Boston's
Symphony Hall,
but in many
cases the home
experience can
now exceed a
live event in
acoustic
quality. For
example, a
recording of
an opera made
in a smallish
studio, can
easily be made
to sound
better at
home, using
the methods
described
below, than it
did to most
listeners at a
crowded
recording
session. One
can also argue
that a home
version of
Symphony Hall,
where one is
apparently
sitting tenth
row center, is
more involving
that the live
experience
heard from a
rear side seat
in the balcony
with
obstructed
visual and
sonic view.
In
a similar
vein, realism
does not mean
perfection. If
a full
symphony
orchestra is
recorded in
Carnegie Hall
but played
back as if it
were in
Carnegie
Recital Hall,
one may have
achieved
realism but
certainly not
perfection.
Likewise, as
long as
localization
is as
effortless and
as precise as
in real life,
the reproduced
locations of
discrete sound
sources
usually don't
have to be
exactly in the
same positions
as at the
recording site
to meet the
standards of
realism
discussed
here. (Virtual
Reality
applications,
by contrast,
often require
extreme
accuracy but
realism is not
a
consideration.)
An example of
this occurs if
a recording
site has a
stage width of
120° but is
played back on
a stage that
seems only
90° wide.
What this
really means
in the context
of realism is
that the
listener has
moved back in
the reproduced
auditorium
some fifteen
rows, but
either stage
perspective
can be
legitimately
real. Finally,
being able to
localize a
stage sound
source in a
stereo,
surround sound
or Ambisonics
system does
not guarantee
that such
localization
will sound
real. For
example, a
soloist
reproduced
entirely via
one
loudspeaker is
easy to
localize but
almost never
sounds real.
Reality
Is In The Ear
Of The
Behearer
While
it is always
risky to make
comparisons
between
hearing and
seeing, I will
live
dangerously
for the
moment. If
from birth,
one were only
allowed to
view the world
via a small
black and
white TV
screen, one
could still
localize the
position of
objects on the
video screen
and could
probably
function quite
well. But
those of us
with normal
sight would
know how drab,
or I would say
unrealistic,
such a
restricted
view of the
world actually
was. If we now
added color to
our subject's
video screen,
the still
grossly
handicapped
(by our
standards)
viewer would
marvel at the
previously
unimaginable
improvement.
If we now
provided
stereoscopic
video, our now
much less
handicapped
viewer would
wonder how he
had ever
functioned in
the past
without depth
perception or
how he could
have regarded
the earlier
flat
monoscopic
color images
as being
realistic.
Finally, the
day would come
when we
removed the
small video
screens and
for the first
time our
optical guinea
pig was able
to enjoy
peripheral
vision and the
full
resolution,
contrast and
brightness
that the human
eye is capable
of and fully
appreciate the
miracle of
unrestricted
vision. The
moral of all
this is that
only when all
the visual
sense
parameters are
provided for,
can one enjoy
true visual
reality. At
the present
time there is
no visual
recording or
display system
that any human
being could
mistake for
the real
thing, but the
IMAX system is
a tantalizing
foretaste of
what might
soon be
possible.
One
can only
achieve
realism if
all the
ear's
expectations
are
simultaneously
satisfied.
Since
most of us are
quite familiar
with what live
music in an
auditorium
sounds like,
we can sense
unreality in
reproduction
quite readily.
But in the
context of
audio
reproduction,
the
progression
toward realism
is similar to
the visual
progression
above. To make
reproduced
music sound
fully
realistic, the
ears, like the
eyes, must be
stimulated in
all the ways
that the
ear-brain
system
expects. Like
the visual
example, when
we go from
mono to stereo
to matrix
surround to
Ambisonics to
multi-channel
discrete,
etc.(listed in
order of
increasing
accuracy,
assuming that
a new
multi-channel
method will
actually
emerge that
can outperform
Ambisonics or
as discussed
below
Ambiophonics)
we marvel at
each
improvement,
but since we
already know
what real
concert halls
sound like, we
soon realize
that something
is missing.
What is
usually
missing is
completeness
and sonic
consistency.
One can only
achieve
realism if all
the ear's
expectations
are
simultaneously
satisfied. If
we assume that
we know
exactly how
all the
mechanisms of
the ear work,
then we could
conceivably
come up with a
sound
recording and
reproduction
system that
would be quite
realistic. But
if we take the
position that
we don't know
all the ear's
characteristics
or that we
don't know how
much they vary
from one
individual to
another or
that we don't
know the
relative
importance of
the hearing
mechanisms we
do know about,
then the only
thing we can
do, until a
greater
understanding
dawns, is what
Manfred
Schroeder
suggested over
a quarter of a
century ago,
and deliver to
the remote
ears a
realistic
replica of
what those
same ears
would have
heard when and
where the
sound was
originally
generated.
Four
Methods Used
To Generate
Reality At A
Distance
Audio
engineers have
grappled with
the problem of
recreating
sound fields
since the time
of Alexander
Graham Bell.
The classic
Bell Labs
theory
suggests that
a curtain, in
front of a
stage, with an
infinite
number of
ordinary
microphones
driving a like
curtain of
remote
loudspeakers
can produce
both an
accurate and a
realistic
replica of a
staged musical
event and
listeners
could sit
anywhere
behind this
curtain, move
their heads
and still hear
a realistic
sound field.
Unfortunately,
this method,
even if it
were
economically
feasible,
fails on the
first two
counts with
any finite
number of
speakers. Such
a curtain can
act like a
lens and
change the
direction or
focus of the
sound waves
that impinge
on it. Like
lightwaves,
sound waves
have a
directional
component that
is easily lost
in this
arrangement
either at the
microphone,
the speaker or
both places.
Thus each
radiating
loudspeaker in
practice
represents a
new discrete
sound source
of
uncontrolled
directionality,
communicating
directly with
both ears and
therefore
generating
comb filter
interference
patterns and
pinna
directional
distortion not
present on the
live side of
the curtain.
Finally
this curtain
of
loudspeakers
does not
radiate into a
concert-hall
size listening
room and so
one would
have, say, an
opera house
stage attached
to a listening
room not even
large enough
to hold the
elephants in
Act 2 of Aida.
This lack of
opera-house
ambience
wouldn't by
itself make
this
reproduction
system sound
unreal, even
if the rest of
the field were
somehow made
accurate, but
it certainly
wouldn't sound
perfect. The
use of speaker
arrays (walls
of hundreds of
speakers)
surrounding a
relatively
large
listening area
have been
shown to be
able to
synthesize any
sound field in
a room with
remarkable
accuracy. But
while this
technique may
be useful in
sound
amplification
systems in
halls,
theaters or
labs,
application to
the playback
of even
multi-channel
recordings in
the home seems
doubtful
except for the
use of speaker
arrays at the
sides and rear
or even
overhead to
deliver truly
diffuse,
reconstituted
reverberant
ambience to
the home
listener.
In
general,
multi-channel
recording
methods or
matrix
surround
systems (Hafler,
SQ, QS, UHJ,
Dolby,
5.1,etc.) seem
like exciting
improvements
when first
heard by long
deprived
stereo music
auditors, but
in the end
don't sound
real.
The
Binaural
Approach
A
second more
practical and
often exciting
approach is
the binaural
one. The idea
is that, since
we only have
two ears, if
we record
exactly what a
listener would
hear at the
entrance to
each ear canal
at the
recording site
and deliver
these two
signals,
intact, to the
remote
listener's ear
canals then
both accuracy
and realism
should be
perfectly
captured. This
concept almost
works and
could
conceivably be
perfected, in
the very near
future, with
the help of
advanced
computer
programs,
particularly
for virtual
reality
applications
involving
headsets or
near field
speakers. The
problem is
that if a
dummy head,
complete with
modeled ear
pinnae and ear
canal embedded
microphones,
is used to
make the
recording,
then the
listener must
listen with
in-the-ear-canal
earphones
because
otherwise the
listeners own
pinnae would
also process
the sound and
spoil the
illusion.
The
real
conundrum,
however, is
that the dummy
head does not
match closely
enough any
particular
human
listeners head
shape or
external ear
to avoid the
internalization
of the sound
stage whereby
one seems to
have a full
symphony
orchestra (and
all of
Carnegie Hall)
from ear to
ear and from
nose to nape.
Internalization
is the
inevitable and
only logical
conclusion a
brain can come
to when
confronted
with a sound
field not at
all processed
by the head or
pinnae. For
how else could
a sound have
avoided these
structures
unless it
originated
inside the
skull? If one
uses a dummy
head without
pinnae, then,
to avoid
internalization,
one needs
earphones that
stand off from
the head, say,
to the front.
But now the
direction of
ambient sound
is incorrect
and side
localization
is not fully
accurate. IMAX
is an example
of this off
the ear
method, as
supplemented
with
loudspeakers.
There is also
a circumaural
earphone
design that
places a tiny
speaker just
over the notch
in the lower
front part of
the ear so
that many of
the pinna
resonances are
still normally
excited for
frontal
sounds. A
similar ear
speaker over
the upper rear
part of the
ear can
provide a
similar pinna-friendly
input for rear
originating
sounds.
Unfortunately,
headshape
differences
between the
dummy head and
the listeners
head remain,
and the dummy
head should
not have
modeled pinnae
if these
earphones are
to be used.
The
fact that
binaural sound
via earphones
runs into so
many
difficulties
is a powerful
indication
that
individual
head shapes
and outer ear
convolutions
are critically
important to
our ability to
sense sonic
reality.
Wavefield
Synthesis
A
third
theoretical
method of
generating
both an
accurate and a
realistic
soundfield is
to actually
measure the
intensity and
the direction
of motion of
the
rarefactions
and
compressions
of all the
impinging
soundwaves at
the single
best listening
position
during a
concert and
then recreate
this exact
sound wave
pattern at the
home listening
position upon
playback. This
method is the
one expounded
by the late
Michael Gerzon
starting in
the early 70's
and embodied
in the
paradigm known
as Ambisonics.
In Ambisonics,
(ignoring
height
components) a
coincident
microphone
assembly,
which is
equivalent to
three
microphones
occupying the
same point in
space,
captures the
complete
representation
of the
pressure and
directionality
of all the
sound rays at
a single point
at the
recording
site. In
reproduction,
speakers
surrounding
the listener,
produce
soundwaves
that
collectively
converge at
one point (the
center of the
listeners
head) to form
the same
rarefactions
and
compressions,
including
their
directional
components,
that were
recorded.
In
theory, if the
reconstructed
soundwave is
correct in all
respects at
the center of
the head (with
the listeners
head absent
for the
moment) then
it will also
be correct
three and one
half inches to
the right or
left of this
point at the
entrance to
the ear canals
with the head
in place. The
major
advantage of
this technique
is that it can
encompass
front stage
sounds, hall
ambience and
rear sounds
equally, and
that since it
is recreating
the original
sound field
(at least at
this one
point) it does
not rely on
the phantom
image
mechanism of
Blumlein
stereo. On the
other hand
Ambisonic
theory is mute
on the subject
of how the
sounds coming
from the
various
loudspeakers
are modified
by the ear
pinna and the
head shape and
how a decoder
might
compensate for
these effects.
Thus
the Ambisonic
method is not
easy to keep
accurate at
frequencies
much over 2000
Hz and must
and does rely
on the
apparent
ability of the
brain to
ignore this
lack of
realistic high
frequency
pinna, head
and waveform
localization
input and
localize on
the basis of
the easier to
reconstitute
lower
frequency
waveforms
alone. This
would be fine
if
localization,
by itself,
equated to
realism or we
were only
concerned with
movie surround
sound
applications.
Other
problems with
basic
Ambisonics
include the
fact that it
requires at
least three
recorded
channels (if
we are
concerned
about quality)
and therefore
can do little
for the vast
library of
existing
recordings.
Back on the
technical
problem side,
one needs to
have enough
speakers
around the
listener to
provide
sufficient
diversity in
sound
direction
vectors to
fabricate the
waveform with
exactitude and
all these
speakers
positions,
relative to
the listener,
must be
precisely
known to the
Ambisonic
decoder.
Likewise the
frequency,
delay and
directional
responses of
all the
speakers must
be known or
closely
controlled for
best results
and as in all
other
loudspeaker
systems the
effects of
listening room
reflections
must also be
taken into
account, or
better yet,
eliminated.
As
you might
imagine, it is
quite
difficult,
particularly
as the
frequency goes
up, to insure
that the size
of the
reconstructed
field at the
listening
position is
large enough
to accommodate
the head, all
the normal
motions of the
head, the
everyday
errors in the
listener's
position, and
more than one
listener.
Those readers
who have tried
to use the
Lexicon
panorama mode,
the Carver
sonic hologram
or the Polk
SDA speaker
system, all
designed to
correct the
higher
frequency
parts of a
simple stereo
soundfield at
the listener's
ear by
acoustic
cancellation
will
appreciate how
difficult this
sort of thing
is to do in
practice, even
when only two
speakers are
involved.
In
my opinion,
however, the
basic barrier
to reality,
via any single
point waveform
reconstruction
method, like
Ambisonics, is
its present
inability, as
in the
binaural case,
to accommodate
to the effects
of the outer
ear and the
head itself on
the shape of
the waveform
actually
reaching the
ear canal. For
instance, if a
wideband
soundwave from
a left front
speaker is
supposed to
combine with a
soundwave from
a rear right
speaker and a
rear center
speaker etc.
then for those
frequencies
over say 2500
Hz the left
ear pinna will
modify the
sound from
each such
speaker quite
differently
than expected
by the
equations of
the decoder,
with the
result that
the waveform
will be
altered in a
way that is
quite
individual and
essentially
impossible for
any practical
decoder to
control. The
result is good
low frequency
localization
but poor or
non-existent
pinna
localization.
Unfortunately,
as documented
below, mere
localization,
lacking
consistency,
as is
unfortunately
the case in
stereo,
surround sound
or Ambisonics
is no
guarantor of
realism.
Indeed, if we
must sacrifice
a localization
mechanism, let
it be the
lowest
frequency one.
Finally,
one can make a
case that one
can have
glorious
realism, even
without any
detailed front
stage
localization,
as long as
ambient
localization
is
directionally
correct (as
anyone who has
sat in the
last row of
the family
circle in
Carnegie Hall
can attest
to).
Ambiophonics
The
fourth
approach, that
I am aware of,
I have called
Ambiophonics.
Ambiophonics,
which borrows
a little from
Binaural and
still less
from
Ambisonics,
assumes that
there are more
localization
mechanisms
than are
dreamed of in
the previous
philosophies
and strives to
satisfy all of
the
mechanisms, as
far as is
possible. It
also takes the
psychoacoustic
position that
absolute
binaural
positional
accuracy, as
opposed to
absolute
realism, is
not as vital
and
furthermore,
that this
reproduction
technology
need only be
concerned with
reproducing
staged
acoustical
musical
events, not
movies or
virtual
reality. The
advantage of
focusing on
just one
aspect of
sonic reality
is that this
reality is
achievable
today, is
reasonable in
cost, and is
applicable to
existing LPs
and CDs.
One
basic element
in Ambiophonic
theory is that
it is best not
to record rear
and side
concert-hall
ambience or
try to extract
it later from
a difference
signal or
recreate it
via waveform
reconstruction,
but to
synthesize the
ambient part
of the field
using real
stored concert
hall data to
generate
ambience
signals using
the new
generation of
digital signal
processors.
The variety
and accuracy
of such
synthesized
ambient fields
is limited
only by the
skill of
programmers
and data
gatherers, and
the speed and
size of the
computers
used. Thus, in
time, any
wanted degree
of concert
hall design
perfection
could be
achieved. A
library of the
worlds great
halls may be
used to
fabricate the
ambient field
as has already
been done with
startling
success in the
JVC XP-A1010.
The number of
speakers
needed for
ambience
generation
does not need
to exceed six
or eight
(although
speaker walls
would be
optimum) and
is comparable
to Ambisonics
or surround
sound in this
regard, but
even more
speakers could
be used as
this synthesis
method is
completely
scaleable and
the quality
and location
of these
speakers is
not critical.
Ambiophonics
is usually
less limited
as to the
number of
listeners who
can share the
experience at
the same time
compared to
most
implementations
of other
methods using
a similar
number of
speakers.
Fortunately,
two to five
people can be
accommodated
by
Ambiophonics
in several of
its practical
incarnations.
The
other basic
tenet of
Ambiophonics
is similar to
Ambisonics and
that is to
recreate at
the listening
position an
exact replica
of the
original
pressure
soundwave.
However,
Ambiophonics
does this by
transporting
the sound
source, stage,
and hall to
the listening
room rather
than a point
wavefront to
the ears. In
other words,
Ambiophonics
externalizes
the binaural
effect, using,
as in the
binaural case,
just two
recorded
channels but
with two front
stage
reproducing
loudspeakers
and eight or
so ambience
loudspeakers
in place of
earphones.
Ambiophonics
generates
stage image
widths up to
120° with an
accuracy and
realism that
far exceeds
that of any
other 2
channel
reproducing
scheme. While
it hardly
seems to be
necessary, the
use of four
channels and
four main
front
loudspeakers
can produce a
full 180°
stage image,
(see below)
but I doubt
the expense
would be worth
it since I for
one have never
attended a
live
performance,
and had a
seat, where
the music came
from anything
approaching a
full half
circle.
For
reasons
outlined
below,
Ambiophonic
reproduction
does require
that the two
main front
speakers
subtend an
angle of only
about 10°
each side of
the listening
position so as
not to
generate the
kind of pinna
angle
distortion for
central sounds
that
phantom-image-stereo,
Ambisonic or
surround sound
speaker
placement
almost always
gives rise to.
Ambiophonics
also requires
that a small
lightweight,
sound
absorbing
panel be
placed on
edge, centered
in front of
the listening
position so as
to prevent the
left front
speaker signal
from reaching
the right ear
and vice
versa. While
there are
electronic
means to
accomplish
this end,
(Carver,
Lexicon,
Cooper-Bauck-Harmon
Intl.) and
extra speaker
means (Polk or
easily home
made) none of
these work as
well as a
small
inexpensive
panel. The
Ambiophonic
listener is
free to rotate
his head, rock
back and
forth, and
undulate from
side to side,
without image
shift, just as
in a concert
hall. Most
audio
enthusiasts
imagine that
the use of the
panel will be
objectionable
on aesthetic
grounds. I
certainly wish
I could think
of a less
problematical
way to
accomplish the
same end, but,
at least in
practice, one
gets used to
the panel very
quickly and
soon wonders
why anyone
listens
without one.
The new
lightweight
materials make
it easy to
store the
panel between
sessions or
provide extras
for multiple
listeners.
The
Stereo Dipole,
AES Preprint
4463
For
those
unalterably
opposed to
using a panel,
Ole Kirkeby,
and Philip A.
Nelson of The
University of
Southampton
with Hareo
Hamada of
Tokyo Denki
University
have developed
an electronic
version of the
panel. They
have shown
that the ideal
speaker
spacing for a
crosstalk
cancellation
sytem be it
mechanical or
electronic is
about 10
degrees. They
refer to two
speakers
placed so
close together
as a
"stereo
dipole".
The electronic
filters
required to
cancel
crosstalk in
this
arrangement
are somewhat
easier to
design and are
more effective
since at the
narrower angle
there is
little
diffraction
around the
head for the
correction
signals and so
HRTF
correction is
not critical.
Pinna angle
distortion of
the correction
signals is
also not a
major factor
and so the
crosstalk
cancellation
can be allowed
to operate
over the full
upper
frequency
range without
restricting
the size of
the listening
area or
generating the
audible
phasiness
effects that
afflict
electronic
crosstalk
cancellation
schemes for
widely spaced
loudspeakers.
A simple, low
cost,
lightweight
panel will
still remain
the best
choice for
critical
listeners.
Since
Ambiophonics
is a binaural
based system,
it does not
provide the
Blumlein
loudspeaker
crosstalk
signal that
furnishes the
lowest
frequency
phase shift
localization
cues for
recordings
made with a
coincident
microphone.
But to
counterbalance
this,
Ambiophonics,
or any
crosstalk
elimination
idea, is more
compatible
than is
standard
stereo with
the
overwhelming
majority of
non-coincident
microphone
recording
arrangements
and the
improvement in
HF
localization
more than
compensates
for any loss
in coincident
mic LF
localization.
Furthermore,
depending on
its size and
absorbency,
the barrier
(and even its
electronic
cousins) loses
its
effectiveness
at low
frequencies
thus allowing
some crosstalk
and therefore
amplifying LF
phase cues for
coincident
microphone
recordings.
One can also
move a little
further back
from the edge
of the barrier
or use a
smaller panel
when listening
to coincident
mic
recordings.
As
in all
realistic
systems, room
treatment is
essential for
a good result
and I have
found that
reducing the
room
reverberation
time to less
that .2
seconds works
well in this
context
especially if
used in
combination
with very
directional,
diffraction-free,
point-source,
front channel
loudspeakers
as once
recommended by
Malcolm
Hawksford in
another
context.
Other
Contrasts
Between
Ambiophonics
and Ambisonics
The
really
fundamental
difference
between
Ambisonics and
Ambiophonics
is that
Ambisonics
attempts to
fabricate the
exact
compressions
and
rarefactions,
including
their
intensities
and directions
at each ear
canal, by
summing the
outputs of a
given array of
sound emitters
whose drive
signals must
be derived by
computation
from three f
(front), s
(side),
directional
velocity
microphone
signals and
the o
omnidirectional
pressure
signal. (Since
most readers
will not be
familiar with
the
mathematical
symbols used
in Ambisonics
I will use o
instead of w
for the
omnidirectional
signal, f for
the front-rear
x signal and s
for the
left-right
side y signal.
We ignore the
'h'eight (z)
axis signal
here.) In
theory, there
could be a
playback
computer that
was fast
enough to
process such a
three channel
mic input with
the accuracy
needed to
produce a
perfect
spherical wave
front of say
the fifth
degree and the
fourth order
up to 15kHz.
Each user
would also
have to load
his personal
pinna response
curve into
this computer
to get the
correct
waveform at
the entrance
to the ear
canal. Each
speaker signal
would then be
convoluted by
the
appropriate
direction-dependent
pinna
function. You
would also
have to place
six or more
speakers
accurately,
enter their
delay and
polar
responses into
the computer,
and do
something
about room
reflections.
The recording
medium would
still require
three discrete
channels and
so this
powerful
Ambisonic
computer would
not do much
for non-Ambisonic
recordings.
So
far, by
restricting
itself to
relatively low
frequency
waveform
synthesis,
Ambisonics has
been able to
function at
reasonable
cost, please
its adherents
and seem a
promising
candidate as a
standard for
360° surround
sound for
video or
virtual
reality
applications
that require
height. In
contrast,
Ambiophonics
does not try
to create the
exact sound
field at the
recording
site, only one
that could
exist, that
can be
reproduced
without
generating
localization
contradictions,
and one that
can be
accepted by
the brain as
real. The
stage image
heard or the
hall ambience
that the
Ambiophonic
computer
generates may
not be exactly
Carnegie Hall
or may not
always be as
the recording
engineer
remembers, but
this system is
doable now and
works well
with most
existing
recordings,
even mono
ones. This is
not to say
that the
design of a
stored
ambience
convolution
computer is a
trivial
project, or
that it will
ever be a very
low cost
device, but
using stored
descriptions
of existing
halls makes
the job a lot
easier than
starting a
synthesis
program from
scratch and
the concert
hall
auralization
tools, already
used by
architects
today, could
be applied at
once to a
consumer
product.
Also,
in contrast to
Ambisonics,
Ambiophonics
does not
require a
known precise
placement of
ambience
speakers and
their polar
radiation
characteristics
are not of
critical
importance.
Remember that
small changes
in any ambient
field are
equivalent to
small changes
in the hall
volume, shape,
or finishes,
or shifts in
ones seat, or
in the number
of people in
the audience.
Psychoacoustic
Fundamentals
Related to
Realism in
Reproduced
Sound
Our
problem is how
to achieve
realistic
sound with the
psychoacoustic
knowledge at
hand or
suspected. For
starters, the
fact that
separated
front
loudspeakers
can produce
centrally
located
phantom images
between
themselves is
a
psychoacoustic
fluke that has
no purpose or
counterpart in
nature and is
a poor
substitute for
natural
frontal
localization.
Any
reproduction
methods that
rely on
stimulating
phantom
images, and
this includes
not only
stereo but
most versions
of surround
sound, can
never achieve
realism even
if they
achieve
localization.
Realism also
cannot be
obtained
merely by
adding
surround
ambience to
phantom
localization.
Ambisonics,
Binaural, and
Ambiophonics
do not employ
the phantom
image
mechanism to
provide the
front stage
localization
and therefore,
in theory,
should all
sound more
realistic than
stereo and, in
fact, do.
Ambiophonic
microphone
arrangements
could make
this approach
to realism
even more
effective, but
I am happy to
report that
Ambiophonics
works quite
well with most
of the
microphone
setups used in
classical
music or
audiophile
caliber jazz
recordings.
Adding home
generated
ambience,
provides the
peripheral
sound vision
to perfect the
experience.
Since
our method is
to just give
the ears
everything
they need to
get real, it
is not
essential to
prove that the
pinna (and I
usually mean
this word to
also include
the concha,
the head and
the torso) are
more important
than some
other part of
the hearing
mechanism, but
the plain fact
is that they
are. To me it
seems
inconceivable
that anyone
could assume
that the pinna
are vestigial
or less
sensitive in
their own
frequency
domain then
the other ear
structures are
in theirs. As
a
hunter-gatherer
animal, it
would be of
the utmost
importance to
sense the
direction of a
breaking twig,
a snake's
hiss, an
elephant's
trumpet, a
birds call,
the rustle of
game etc. and
probably of
less
importance to
sense the
lower
frequency
direction of
thunder, the
sigh of the
wind, or the
direction of
drums. The
size of the
human head
clearly shows
the bias of
nature in
having humans
extra
sensitive to
sounds over
700 Hz.
Look
at your ears.
The extreme
non-linear
complexity of
the outer ear
structures,
and their
small
dimensions
defies
mathematical
definition and
clearly
implies that
their exact
function is
too complex
and too
individual to
understand,
much less
fool, except
in half-baked
ways. The
convolutions
and cavities
of the ear are
so many and so
varied so as
to make sure
that their
high frequency
response is as
jagged as
possible and
as distinctive
a function of
the direction
of sound
incidence as
possible. The
idea is that
no matter what
high
frequencies a
sound consists
of or from
what direction
a transient
sound comes
from, the
pinnae and
head together
or even a
single pinna
alone will
produce a
distinctive
pattern that
the brain can
learn to
recognize in
order to say
this sound
comes from
over there.
The
outer ear is
essentially a
mechanical
converter that
maps discrete
received sound
directions to
preassigned
frequency
response
patterns.
There is also
no purpose in
having the
ability to
hear
frequencies
over 10 kHz,
say, if they
cannot aid in
localization.
The dimensions
of the pinna
structures and
the
measurements
by Møller,
strongly
suggest, if
not yet prove,
that the pinna
do function
for this
purpose even
in the highest
octave.
Møller's
curves of the
pinna and head
functions with
frequency and
direction are
so complex
that the
patterns are
largely
unresolvable
and very
difficult to
measure using
live subjects.
Again, it
doesn't matter
whether we
know exactly
how anyone's
ears work as
long as we
don't
compromise on
bandwidth,
frequency
response,
loudness,
distortion,
and especially
source
directionality,
at all
frequencies,
during
reproduction.
The
Evidence For
Pinna
Localization
Priority
The
above doesn't
mean that we
have to ignore
all the
research that
has preceded
us. The
literature
overwhelmingly
supports the
view that
localization
for broadband
sounds at
frequencies
over
approximately
1.5 kHz., is
based on
single pinna,
dual pinnae
and the HRTF
(Head Related
Transfer
Function) and
is stronger,
(more accurate
is a better
word) than the
localization
ability of the
ear at
frequencies
below say 600
Hz. (In this
and my other
papers on this
subject, I try
to use the
term HRTF to
refer only to
head and torso
effects that
modify sounds
before they
reach the
outer ears.) I
believe the
references
referred to
below, even
support the
notion that
localization
accuracy is
directly
proportional
to the
frequency of
complex
music-like
sounds which
goes a long
way toward
explaining why
transient
localization
is so strong.
It also
explains the
Franssen
Effect where
sound is
localized to
the source
sounding the
transient part
of a complex
signal that
has been
broken up into
two parts, one
the transient
and the other
the continuing
lower
frequency
sinusoid. See
Blauert,
Spatial
Hearing.
William
B. Snow in
Basic
Principles of
Stereophonic
Sound, 1953,
as reprinted
in
Stereophonic
Techniques,
states
"for
impulsive
sounds such as
speech or
clicks,
differences as
small as 1°
or 2° can be
perceived."
He goes on
"The
intensity
differences
(at the ears)
due to
diffraction
are functions
of frequency
and cause a
complex sound
to have a
different
frequency-intensity
composition or
quality at
each ear. It
is undoubtedly
this effect
which removes
ambiguities in
direction
because the
diffraction
effects are so
complicated
that a given
quality
difference can
correspond
only to one
direction."
Unfortunately,
Snow never
used the word,
pinna, but he
does say
"however,
in the higher
frequency
region,
intensity
differences
produced by
the
diffraction or
sound shadow
effects of the
head and
external ears
become great
enough to give
angular
localization."
But
you say, Snow
does not say
one mechanism
is stronger
than another
although his
use of the
word, clicks,
strongly
implies this.
Fair enough.
An earlier bit
of research in
England by
James Moir in
Oct. 1952 in
Audio Magazine
as reprinted
in
Stereophonic
Techniques is
even more
explicit on
this point.
(See the
complete
bibliography
now on my web
site http://www.ambiophonics.org).
In his Table
Two he reports
on the
accuracy of
location as a
function of
the frequency
band of
filtered male
speech used as
a test signal.
For a
frequency band
of 50 to 500
Hz the average
localization
error was
3.8°, for 500
to 3000Hz the
average error
was .9°, and
for 3000 to
7000Hz (a
rather
restricted
bandwidth) the
average
localization
error was an
astonishingly
low .5°.
Furthermore,
although Moir
did not
comment on
this phenomena
his last table
entry for 50
to 7000 Hz
wideband
speech shows a
slightly
greater error
of .7°. One
could infer
from this
result, that
in the
presence of
sufficient
high frequency
localization
cues, the
lower
frequencies
just get in
the way.
Don
Keele Jr., in
AES preprint
2420,
Nov.1986, says
"We used
wide-band pink
noise for the
input signal
in all carrier
tests. An
interesting
phenomena that
we observed,
was the
breakup of the
sound image.
Changes in
amplitude and
delay are
effective in
shifting the
image only at
certain
frequencies:
Up to 700Hz,
for delay and
greater than
2000Hz, for
amplitude with
the region
between 700Hz
and 2000Hz
effective for
both in
combination.
At times we
would perceive
the low
frequencies
staying at the
origin and the
high
frequencies
shifting or
vice-versa.
The soundfield
(with barrier)
extends much
beyond the
typical stereo
arrangement of
30° to the
left and
right, however
the goal of a
180°
soundfield was
not met. The
amplitude
panned data
show that the
image shifted
in direct
proportion to
the amplitude
differential,
out to roughly
50° or 60°
off axis. The
delay panned
data is
similar in
that an image
shift limit is
found to occur
at roughly the
same angles.
This image
shift limit
noted in both
amplitude and
delay panned
data could be
due to two
possible
reasons:
Imperfect
blocking of
the crosstalk
signal by the
barrier and
the effect of
the ear's
pinna on the
frequency
response of
the received
acoustic
signal..... In
the second
case, the
barrier-speaker
setup
generates
acoustic
signals that
always reach
the listener
coming from
directly
ahead. The
ears are not
receiving the
correct
frequency
response cues
due to pinna
effects, etc.
that signals
coming from
large off-axis
angles would
have. This
means that
additional
processing to
include these
effects may be
necessary to
swing the
signals
further around
to the
side."
Don't
Tolerate Pinna
Privation!
In
my own
experience,
the pinna
clearly
outvote the
lower
frequency
localization
senses. Try
the experiment
outlined at
the
Ambiophonic
web site to
test your own
pinna power.
The
internalization
of the
binaural sound
field
reproduced
with earphones
is another
good example
of the pinna
riding rough
shod over the
interaural
delay cues. It
is true that
the binaural
image does
spread from
ear to ear,
but is this
accuracy
realistic?
Blauert in
Spatial
Hearing, on
page 49
confirms the
everyday
observation
that excellent
broadband
localization
is possible
even for
people totally
deaf in one
ear. One eared
hearing cannot
possibly use
interaural
phase
mechanisms or
interaural
intensity
cues. Thus
there exists a
strong non-interaural
localization
mechanism that
cannot simply
be ignored.
The ability of
pinna
equalizing
boxes such as
the Klayman
NuReality
device to move
images freely,
despite the
presence of
unaltered low
frequency
cues, is also
indicative of
pinna power.
Why is this
relevant to
surround sound
for music or
to Ambisonics?
Well, if the
pinnae are as
important as
the literature
and I suggest,
then the
reconstruction
of the
Ambisonic
plane wave
must be
accurate to
beyond 10kHz
and this is
probably not
achievable.
Let me quote
from
Vanderkooy and
Banford,
Ambisonic
Sound For Us,
AES Preprint
4138, Oct.
1995.
"the
benefits of
the Ambisonic
system rapidly
decline with
increasing
frequency."
Or in Oct.
1987,
Vanderkooy and
Lipshitz, AES
preprint 2554,
"We show
that it is
only in the
low frequency
regime, below
maybe 700Hz
that the
spatial region
within which
an Ambisonic
system will
reasonably-well
reconstruct
the traveling
wave which
would
correspond to
a real
acoustic
source, is
large enough
to encompass
the head of a
central
listener."
Similar
considerations
will likely
apply to most
of the
proposed
multi-channel
recording
standards now
being
considered, as
far as the
realistic
reproduction
of acoustical
musical events
is concerned.
Ambiophonic
Recording for
Realism
One
can heighten
the accuracy,
if not gild
the lily of
realism, of an
Ambiophonic
reproduction
system by
taking
advantage in
the microphone
arrangement of
the knowledge
that in
playback, the
rear half and
side hall
ambience will
be
synthesized,
that there is
no crosstalk,
that listening
room
reflections
are minimized
and that the
front
loudspeakers
are relatively
close
together. For
political
reasons and as
an educational
exercise, we
can use
Ambisonic
nomenclature
to describe
such an
arrangement.
The sound
waves at a
given point in
space can be
completely
captured by
placing three
imaginary
microphones
simultaneously
at that
precise spot,
again ignoring
height. One of
these
microphones is
a pressure
microphone
whose
omnidirectional
output (o) is
simply
proportional
to the
instantaneous
value of all
the
compressions
and
rarefactions
at that point
and moment
adding and
subtracting. A
single
unobstructed o
microphone
signal
inherently
contains no
directional
information
(although it
could if it
were baffled).
The second
microphone is
a figure eight
(velocity)
microphone
pointing
straight ahead
and straight
behind. The
output of this
microphone (f)
is amplitude
sensitive to
the direction
from which the
soundwave
comes and
declines to
zero in cosine
fashion as the
sound moves
from directly
in front to
directly at
the side. In
other words a
velocity
microphone is
a
direction-to-amplitude
encoder. Such
a microphone
is similarly
sensitive to
sounds coming
from the rear,
but the
polarity of
such signals
is inverted.
The third
microphone (s)
is a second
identical
figure eight
microphone so
oriented that
it is most
sensitive to
sounds from
the left or
right and has
zero output
for signals
dead ahead or
dead behind.
We will
ignore, for
the moment,
the frequency
response and
other
aberrations of
real
microphones
and the
difficulty of
actually
making three
or even two
microphones
truly
coincident by
mechanical
means alone.
It
would be very
simple, in
theory, to
combine
Binaural and
Ambisonic
methods to
produce
Ambinaural by
using two
headspaced
o,f,s
microphones,
six recording
channels, and
one Ambisonic
speaker
decoder for
each ear.
Ideally one
would want the
playback
decoder for
each ear to be
able to
completely
determine the
sound at each
ear but, in
reality it is
almost
impossible to
prevent one
set of
speakers from
also
communicating
with the wrong
ear but one
could use the
Ambiophonic
panel to
isolate the
two sets of
speakers.
Ambiosonics?
The decoders
in this six
channel system
would only
have to
generate a
frontal
wavefield at
one ear from
signals in one
90° quadrant,
perhaps using
three speakers
on each
frontal side,
which is much
easier to do
than for the
general case
of an entire
circle since
the rear half
of the ambient
field is
synthesized in
our case.
Since all of
the decoded
signals would
be coming from
more or less
the proper
directions the
pinna angle
distortion
problem would
not be
serious. This
is the brute
force,
cost-no-object,
method and
indeed the
recent ARA
proposal, for
DVD audio,
suggests a six
channel
recording
format but
they don't
have anything
like
Binaurosonics
in mind, yet.
Reductio
Ad Ambiosurdum
Since
it is very
difficult to
mount three
microphones so
that they are
really
coincident,
and although
there are
electronic
position
shifters that
correct for
this, we are
still left
with six
outputs when
we want
Ambiophonics
to work with
two. The trick
is to see how
the Ambisonic
microphone
arrangement
can be
simplified
without
sacrificing
anything vital
to realistic
music
reproduction.
The first
thing we
should
remember is
that we do not
need to pickup
any sound from
the rear half
circle since
that ambience
is computer
generated.
Thus if our
microphone is
baffled, we
can ignore the
rear half lobe
of the f
microphones
and the rear
halfcircle of
the o and s
signals. For
angles close
to the middle,
the f and o
signals would
be very much
the same so
let us assume
that we delete
the two f
signals
altogether. We
also know from
our earlier
discussions
that a
coincident
microphone
requires
crosstalk in
reproduction
in order to
provide low
frequency
phase shift
differential
localization.
But the
binaural
arrangement of
the two
coincident
microphone
assemblies
eliminates
this concern.
If we use two
coincident s,o
microphones,
(of course, if
two microphone
sets are used
they cannot be
coincident in
Ambisonic
lingo, but
bear with me.)
spaced the
average
distance
between the
ears, we
restore the
natural low
frequency
phase cues.
Remember that
in Ambisonics,
the problem
was that if
the wave was
correct at the
center of the
head it might
not be exact
enough at the
sides of the
head. Well, in
our
Ambiophonic
version of
Ambisonics,
the right and
left ear
signals are
isolated by
the panel or
panels
extending from
the listeners
between the
speakers, so
now we can
independently
generate a
wave for each
ear. This
means that for
the central
90° or so
sound sources,
the left half
of a right
side o
microphone
circle and the
right half of
a left o
circle are not
really
required. Thus
the decoding
equations
become even
simpler.
If
the mics were
now mounted in
a dummy head
with pinnae,
then the
signal would
be modified by
pinnae
squared. (bad)
But if the
mics were just
mounted on a
head wide
boom, then on
playback,
since the
speakers are
close
together,
there would be
no head
related
response even
for signals
from the
sides.
Therefore the
best
arrangement is
to use a dummy
head without
pinna and
place either
an o or an o-s
pair at the
entrance to
the dummy head
ear canal. The
dummy head
will of course
not match the
listeners head
exactly, but
the effect is
slight, since
in this case
the pinna are
not involved
and this
discrepancy
only effects
side sources.
Someday, one
will be able
to get his
head response
measured and
correct for
the difference
between his
head and the
standard dummy
head. Looked
at this way,
Ambiophonics
is just a
subset of
Ambisonics or
a superset of
Binaural.
I
haven't
forgotten
about the s
signals. Since
the half
exposed s
microphones
point to the
extreme front
sides, if
there is no
music there,
they are not
always needed.
Depending on
the exact
aperture of
the o
microphones,
the shadowing
effect of the
dummy head,
the
loudspeaker
angle and the
lower
frequency cues
present, stage
widths of
120° are
obtainable
without an s
signal.
However, if
four channels
are available
and two more
front side
speakers are
used, the
stage width
can be
expanded. In
this case
interaural
crosstalk is
not as much of
a spoiling
factor
because, by
definition,
extreme side
sound sources
engender
neither pinna
angle
distortion nor
loudspeaker
crosstalk but
there is a
likelihood of
localization
to a single
side speaker,
and to avoid
this a full
fledged
Ambisonic
decoder could
be used with a
pair of side
speakers.
Another
possibility is
to modify the
s signal with
a pinna
equalizer and
add it to the
o signal so
that when it
comes from the
front speakers
it will sound
as natural as
possible even
in the absence
of any
interaural low
frequency
cues.. Again,
for best
results, this
would require
having your
pinna response
measured and
applying a
correction to
the hopefully
published
recording
company pinna
equalizer.
There is at
least one
company that
maintains that
25 pinna
response
curves
describe over
90% of the
worlds
population and
they have put
such pinna
responses in a
box so you can
select the one
that works
best with your
own ears. Thus
the personal
pinna era may
be here sooner
rather than
later.
Nothing
New Under the
Sun
After
completing the
above text on
recording, I
began to see
if the
recordings
that were easy
to playback
Ambiophonically
had anything
consistent or
unusual about
them. Not
being a
recording
engineer or a
microphone
aficionado, it
took me awhile
to notice that
many of the
easy to match
CDs were made
with something
called a
Schoeps KFM-6.
A picture of
this
microphone in
a PGM
Recordings
promotional
flyer showed a
head sized but
spherical ball
with two
omnidirectional
microphones
one recessed
on each side
of the ball
where ear
canals would
be if we had
an exactly
round head.
The PGM flyer
also included
a reference to
a paper by
Günther
Theile
describing the
microphone,
entitled On
the
Naturalness of
Two-Channel
Stereo Sound,
J. Audio Eng.
Soc., Vol 39,
No. 10, 1991
OCT.
Although
Theile would
probably
object to my
characterization
of his
microphone,
his design is
essentially a
simplified
dummy head
without
external ears.
He states, It
is found that
simulation of
depth and
space are
lacking when
coincident
microphone and
panpot
techniques are
applied. To
obtain optimum
simulation of
spatial
perspective it
is important
for two
loudspeaker
signals to
have
interaural
correlation
that is as
natural as
possible...Music
recordings
confirm that
the sphere
microphone
combines
favorable
imaging
characteristics
with regard to
spatial
perspective
accuracy of
localization
and sound
color... Later
he states The
coincident
microphone
signal, which
does not
provide any
head-specific
interaural
signal
differences,
fails not only
in generating
a
head-referred
presentation
of the
authentic
spatial
impression and
depth, but
also in
generating a
loudspeaker-referred
simulation of
the spatial
impression and
depth......it
is important
that, as far
as possible,
the two
loudspeaker
signals
contain
natural
interaural
attributes
rather than
the resultant
listener's ear
signals in the
playback room.
One
minor problem
with the
Theile
approach
remains. For
signals coming
from the side,
the sphere
acts as sort
of filter for
the shorter
wavelengths
just as the
head does.
When this side
sound comes
from side
stereo
speakers the
listeners head
acts as a
filter again
resulting in
HRTF squared.
The solution
is to move the
speakers to
fifteen
degrees in
front of the
listener use
the barrier
and listen to
the Theile
sphere without
the second
head response
function. I
have done this
and it works.
Eventually, a
listener would
substitute his
own pinnaless
HRTF for that
of the sphere
and the
accuracy would
be further
enhanced.
Theile
also
"generates
artificial
reflections
and
reverberation
from
spot-microphone
signals."
He uses the
word
artificial in
the sense that
the spot
microphone
signals will
be coming from
the front
stereo
loudspeakers
instead of
from the rear,
the sides, or
overhead.
While Theile's
results rest
as much on
empirical
subjective
opinion as
they do on
psychoacoustic
precepts, they
certainly are
consistent
with the
premises of
Ambiophonics
both in
recording and
reproduction.
Realistic
Reproduction
of Depth
It
is axiomatic
that a
realistic
music
reproduction
system should
render depth
as accurately
as possible.
Fortunately,
front stage
distance cues
are easier to
record and/or
simulate
realistically
than most
other
parameters of
the
concert-hall
sound field.
Assuming that
the recording
microphones
are placed at
a reasonable
distance from
the front of
the stage,
then the high
frequency
roll-off due
to distance
and the
general
attenuation of
sound with
distance
remain viable
distance cues
in the
recording.
Depth of
discrete stage
sound sources
is, however,
more strongly
evidenced in
concert-halls
by the
amplitude and
delay of the
early
reflections
and the ear
finds it
easier to
sense this
depth if there
is a diversity
of such
reflections.
In
Ambiophonics,
simulated
early
reflections
from 55° make
the stage as a
whole seem
more
interesting,
but it is only
the recorded
early
reflections
coming from
the front
speakers that
provide the
reflections
that allow
depth
differentiation
between
individual
instruments.
This is why
anechoic
recordings
sound so flat
when played
back
stereophonically
or even
Ambiophonically,
despite the
presence of a
simulated
ambient field.
In ordinary
stereo, depth
perception
will suffer if
early side and
rear hall
reflections
wrap around to
the front
speakers or in
the anechoic
case, are
completely
missing. Since
it is easy to
make
Ambiophonic
recordings
that include
just
proscenium
ambience, why
not do so and
save on
synthesis
processing
power and
preserve,
undistorted,
the depth
perception
cues?
There
remains the
issue of
perspective,
however. When
making a live
performance
recording of
an opera or a
symphony
orchestra the
recording
microphones
are likely to
be far enough
away from the
sound sources
to produce an
image at home
that is not so
close as to be
claustrophobic.
There are many
recordings,
however, that
produce a
sense of being
at or just
behind the
conductors
podium. This
effect does
not
necessarily
impact realism
but you must
like to sit in
the front row
to be
comfortable
with this
perspective.
Turning down
the volume and
adding
ambience can
compensate for
this, but at a
loss in
realism. This
problem
becomes more
serious in the
case of solo
piano
recordings or
small Jazz
combos. For
example, if a
microphone is
placed three
feet from an
eight foot
piano, then
that piano is
going to be an
overwhelming
close-up
presence in
the listening
room and a
"They-Are-Here"
instead of a
"You Are
There"
effect is
unavoidable.
This can be
very realistic
especially
with close
together
speakers and
the
Ambiophonic
barrier, but
adding
synthesized
hall ambience
doesn't help
much since the
direct sound
is so
overwhelming.
The major
problem with
this type of
recording is
that you have
to like having
these people
so close in a
small home
listening
room. You may
notice that
demonstrators
of high
resolution
playback
systems in
show rooms or
at shows,
overwhelmingly,
use small
ensemble, solo
guitar, single
vocalist etc.,
close mic'ed,
recordings to
demonstrate
the lifelike
qualities of
their products
and that these
demonstrations
are mostly of
the "They
Are Here"
variety.
To
Probe Further
or Try It
Yourself
Details
on setting up
an Ambiophonic
playback
system and
other related
topics are
available at
the
Ambiophonics
Institute web
site
www.ambiophonics.org.
The book,
Ambiophonics,
by Keith Yates
and Ralph
Glasgal is
available from
www.amazon.com,
Borders Books
& Music,
and Barnes
& Noble.

|