Masking Effects
Part of the process of mental filtering, described earlier in this chapter,
which occurs unconsciously at every moment for all of us, involves a process
called masking, and is of much interest to students of psychoacoustics:
the study of the interrelation between the ear, the mind, and vibratory audio
signal. Two separate masking effects come into play in MP3 encoding: auditory
and
temporal.
Simultaneous (auditory) masking
The simultaneous masking effect (sometimes referred to as "auditory
masking") may be best described by analogy. Think of a bird flying in
front of the sun. You see the bird flying in from the left, then it seems to
disappear, because the sun's light is so strong in contrast. As it moves past
the sun to the right, it becomes visible again. In more concrete audio terms,
recall how you can sometimes hear an acoustic guitarist's fingers sliding over
the ridged spirals of the guitar strings during quiet passages. Of course, you
seldom if ever hear this effect during a full-on rock anthem, because the wall
of sound surrounding the guitar all but completely drowns these subtle
effects.
The MP3 codec, of course, is unconcerned with
guitar stings; all it knows are relative frequencies and volume levels. So, to
put simultaneous masking into more concrete terms, let's say you have an audio
signal consisting of a perfect sine wave fluctuating at 1,000Hz. Now you
introduce a second perfect sine wave, this one fluctuating at a pitch just
slightly higher-let's make it 1,100Hz-but also much quieter-say, -10
db.[6]
Most humans will not be able to detect the second pitch at all. However,
the reason the second pitch is inaudible is not just because it's quieter;
it's because its frequency is very close (similar) to that of the first. To
illustrate this fact, we'll slowly change the frequency (pitch) of the second
tone until it's fluctuating at, say, 4,000Hz. However, we'll leave its volume
exactly as it was, at -10db. As the second pitch becomes more dissimilar from
the first, it becomes more audible, until at a certain point, most humans will
hear two distinct tones, one louder than the other, as illustrated in Figure
2-2. At Point A, Tone 2 is barely audible next to Tone 1. At Point
B, Tone 2 is quite audible, even though its volume remains unchanged.
Figure2-2: As two simultaneous tones become more
dissimilar, they become recognizable as separate entities
 |
What's going on here is a psychoacoustic
phenomenon called "simultaneous masking," which demonstrates an
important aspect of the mind's role in hearing: Any time frequencies are close
to one another, we have difficulty perceiving them as unique, much as
mountains on the distant horizon may appear to be evenly textured and
similarly colored, even while the same mountains might be full of variation
and rich flora if one were hiking in them. In effect, we have the aural
equivalent of an optical illusion-a trick of our perceptual capacity that
contributes to our brain's ability to filter out the less relevant and give
focus to stronger elements.
Now consider for a moment the fact that an
audio signal consisting of two sine waves-even if one is quieter-contains
almost twice as much data as a signal containing a single wave. If you were to
try and compress an audio signal containing two sine waves, you would want the
ability to devote less disk storage space to the nearly inaudible signal, and
more to the dominant signal. And, of course, this is precisely what the
algorithms behind most audio compression formats do-they exploit certain
aspects of human psychoacoustic phenomena to allocate storage space
intelligently. Whereas a raw (waveform or
PCM[7])
audio storage format will use just as much disk space to store a texturally
constant passage in a symphonic work as it will for a dynamically textured
one, an MP3 file will not. Thus, MP3 and similar audio compression formats are
called "perceptual codecs" because they are, in a sense,
mathematical descriptions of the limitations of human auditory perception. The
MP3 codec is based on perceptual principles but also encapsulates many other
factors, such as the number of bits per second allocated to storing the data
and the number of channels being stored, i.e., mono, stereo, or in the case of
other formats such as AAC or MP3 with MPEG-2 extensions, multi-channel audio.
Temporal masking
In addition to auditory masking, which is
dependent on the relationship between frequencies and their relative volumes,
there's a second sort of masking which also comes into play, based on time
rather than on frequency. The idea behind
temporal masking is that humans also have trouble hearing distinct sounds that
are close to one another in time. For example, if a loud sound and a quiet
sound are played simultaneously, you won't be able to hear the quiet sound.
If, however, there is sufficient delay between the two sounds, you will hear
the second, quieter sound. The key to the success of temporal masking is in
determining (quantifying) the length of time between the two tones at which
the second tone becomes audible, i.e., significant enough to keep it in the
bitstream rather than throwing it away. This distance, or threshold, turns out
to be around five milliseconds when working with pure tones, though it varies
up and down in accordance with different audio passages.
Of course, this process also works in
reverse-you may not hear a quiet tone if it comes directly before a louder
one, so
premasking and
postmasking both occur, and are accounted for in the algorithm.
NOTE
For more information on
psychoacoustics, read any of the excellent papers on the subject at www.cpl.umn.edu/auditory.htm.
Next: Bitrates |