A perceptual space that can explain the robustness of…


MALE SPEAKER: Today, we are
very happy to have Roy Patterson from the Center for
the Neural Basis of Hearing at Cambridge University. Roy has been, in my impression,
a leading light in the hearing research business
for a very long time. We’ve known each other for quite
a while and I could tell you some stories, but
this is not a good time for that, I guess. So we’ll let Roy tell
his stories. Roy, take it away. ROY PATTERSON: Thank you. It’s a delight to be here. I think this microphone
is working. OK, I’m not quite knowing what
the Google audience would be. I’ve given you a selection
of titles. There is size information
in sound. You can tell the difference
between a child an adult, and you need to normalize for that
in order to communicate. So that’s the first title. The second is my description
of how the auditory system does this. It produces a scale-shift
covariant Auditory Image. And the auditory image is just
like a visual image. If you open your eyes,
you see an image. Similarly, when you wake
up, you open your ears. And if there’s sound there, your
auditory system makes an image of it automatically. And it’s the processing that
goes from the sound to that initial image that I’m trying
to characterize. And finally, for a speech
community, making speech recognizers family friendly, I’m
just referring to the fact that if you train a recognizer
on a man, it won’t even understand a woman,
let alone a child. And it’s the children that
actually know how to make the technology work. So you really want the child
to train the recognizer so that you can use it. And this is my idea of making
recognizers family friendly. There are a lot of people
involved in this research. So I’m representing quite
a large group of people doing research. These up here are the postdocs
that are currently in my lab, all of them working on this. Two of them have just got good
jobs elsewhere, but we still keep them on the list because
they’re working with us. And we have four students, one
of whom is here, Tom Walters We also have two well-known
Japanese colleagues who are intimately involved, Hideki
Kawahara, the father of the vocoder STRAIGHT, which you’re
going to hear about, and Toshio Irino, who works with me
on all of the wavelets and Mellin transforming
stuff, size stuff. So you may not know them,
but you may want to. So the overview is that there
is size information in communication sounds,
in speech, music, and animal calls. Humans can extract the content
of the communication without being confused about the
size of the source. And humans can extract the size
information without being confused by the content. Auditory perception is
amazingly robust to variability in communication
sounds, and these size changes change the sound a lot. Machine recognition systems
are not robust to size variability even when it’s a
singular single speaker in clean speech. So, how does the auditory system
to do it and can we capture it in algorithms
and make them useful? So I’m going to start with what
you need to know about the size information in
communications sounds and how people perceive it. Then I’m going to tell you
about, we’ve developed a little recognizer to equate
machines and humans on a certain task so that we can
demonstrate the robustness of humans and to give us a measure
as to how the machines are doing and when they’re
improving. And then, we say, OK, can this
auditory representation be used to produce better features
for the recognizer? And in the end, I’m
going to tell you that it looks promising. Size information and
communication sounds. The sounds that animals use to
communicate at a distance to declare their territories and
to attract mates– the important stuff in
animals’ lives– are typically pulse-resonance
sounds, which means that the animal finds some way of
producing a pulse, typically by banging two hard
parts together. The pulse marks the start
of the communication. The energy goes into the body of
the animal, and that object then resonates. And that resonance is attached
to the back of the pulse, and that resonance tells you about
the structure and the size of the animal. And its that simple trick that
the animals are using. So an animal of interest to us,
the human, the information and speech sounds. You have vocal chords at the
bottom of your throat. They vibrate and interrupt by
interrupting the airflow from the lungs and produce
a stream of pulses. Those pulses excite resonances
in the vocal tract. And the shape of the vocal
tract, the place where you put your tongue, determines
the resonance, and thus, the vowel type. But in addition, vocal tract
length determines the rate at which those resonances run. And this is the new variable
that I need to draw your attention to. So if you have a large person
with a long vocal tract, and then you have a small person
with a short vocal tract, they can actually say vowels with the
same pitch, and they can put the vocal tract in the same
shape so that you hear the large person and the small
person saying the same thing. But you can still tell which
is the big one and which is the small one because the
difference in vocal tract length produces a difference
in the rate at which the resonances behind
the pulse run. So here’s an example. In this slide, there are six
copies of the vowel “ah.” And it’s always the vowel “ah,” and
you can tell that because the resonance after the
pulse is the same shape in all six waves. Over here, the pulses come
at an increasing rate. So over here, we have COMPUTER SOUND: Ah, ah, ah. ROY PATTERSON: Can we turn
that up a touch? Nope, not that way. COMPUTER SOUND: Ah, ah, ah. ROY PATTERSON: So that’s the
pure pitch change, and that’s very simple. I can do that, “ah, ah, ah.”
Over here, notice that the pulses come at the same rate
in all three of these. So the voice pitch is the
same over here, and the vowel is the same. It’s the same shape. But over here, as we go down,
the resonance runs faster. The peaks occur faster, OK? And here, you get COMPUTER SOUND: Ah, ah, ah. ROY PATTERSON: And that is the
sound of a pure vocal tract length change. COMPUTER SOUND: Ah, ah, ah. ROY PATTERSON: I can’t
do that one. COMPUTER SOUND: Ah. Ah. Ah. That’s the dimension we
need to go after. ROY PATTERSON: Now, animals,
typically, all modern animals communicate with what we call
communication syllables. So animals put out their pulses
in bursts, and each of the pulses carries the resonance
behind it, so you get lots of copies of
the resonance in a pulse-resonance sound. So here’s a human communicating
with pulse-resonance sound. COMPUTER SOUND: Mah, mah. ROY PATTERSON: And
here’s a macaque. COMPUTER SOUND: [BIRD CALL] ROY PATTERSON: And– COMPUTER SOUND: [FROG CALL] ROY PATTERSON: Another very
successful communicator. And this is a fish. COMPUTER SOUND: [FISH CALL] ROY PATTERSON: And I include the
fish because they probably started doing this as much
as 500 million years ago. And they do it with very
different mechanisms, but they’re using the same
trick, pulses with resonances attached. They always seem to be about a
half-second long, as well. So that’s pulse-resonance
communication. So what you need to know about
communication sounds is that in the communication sounds
at the syllable level– I mean, there are
larger issues. There’s syntax and semantics
and prosody. But just down at the level
of the individual sound– the syllable– there are three kinds of
important information in communication. There’s the resonance shape,
and that’s the message. There’s the glottal
pulse rate. That’s the pitch. And there is the resonance
scale, which tells you how big that the speaker was. So in animals, they all grow,
and all animals– evolution invariably uses parts
of the body that are already existing to
make the sounds. In our case, it uses the tubes
from the nose and the mouth to the lungs and the stomach. And as you grow up, those tubes
have to get longer to connect you to the
outside world. And so there must be size
information in animal sounds. And this is true
of all animals. And so it’s an important problem
for animals to be able to generalize across sounds
and recognize that a new sound, which is physically
different, is actually just the same message being carried
on by a source, produced by a source of a smaller size. AUDIENCE: Can you tell how
large a cricket is by the sound it makes with its legs? ROY PATTERSON: Oh,
you are very– so the bush-cricket is the one
animal that has learned to make a sinusoid. And it makes a four kilohertz
sinusoid with a patch of its wing, and it doesn’t communicate
until it’s a full-size adult. So it’s one animal that has very
little size information. Pick another animal. AUDIENCE: I’m really curious
about Myna birds because I don’t think they produce
their sound this way. Rather than hitting a vocal
tract with a bunch of pulses, they’re [INAUDIBLE] ROY PATTERSON: Yeah, they
have two syrinxes. They have one on each side, but
if we just take one at a time, then the syrinx does have
a vibrating source that interrupts the airflow. AUDIENCE: Do they have a
continuous and changing resonance or are they
[INAUDIBLE] AUDIENCE: Please repeat
the question. ROY PATTERSON: Oh, what
about the Myna bird? So the first point is
that birds are much more exotic than us. They have a voice box in
each [? bronchi. ?] And actually, the songbirds have
a brain that allows them to play one, play the other,
or play them as a pair. So all of that aside, in the
syrinx is a flapping membrane that is producing pulses. And those pulses are exciting
the vocal tract. Now, in the birds, you
have three sections of the vocal tract– just above the syrinx, and then
you have the throat, and then you have the mouth. But in birds, the mouth often
includes the bill. So it’s complicated, but it
is the same kind of sound. AUDIENCE: [INAUDIBLE] me in
conversation once that [INAUDIBLE] wouldn’t work on
Myna bird speech because it’s varying the pitch pretty
continuously instead of varying in the vocal tract
shape [INAUDIBLE]. ROY PATTERSON: I think
the issue is just– there are more variables
there. I’m just giving you three
of the simple ones. And these apply, but then
there are more. We had a student project on it
that I’m trying to remember, but I can’t. So this is true of human
speech, that they use pulse-resonance sounds, and
it has this three kinds of information at the
syllable level. It’s true animal calls– mammals, birds, fish frogs. It’s true of grasshoppers,
actually, and shrimp. It’s true of most musical
instruments. Most musical instruments, when
you look, have pulse of excitation that resonates in the
body of the instruments. So it appears that certainly
in the sustained-toned instruments of the orchestra,
that we have chosen sound makers for which we already have
an analysis mechanism. But that’s another topic. And, of course, engines have
explosive pulse of excitation. And as you know, any good
mechanic will listen to the car for a while before he takes
it apart because he can probably tell a fair
bit about what’s wrong just by listening. These stand in contrast to the
inanimate sounds produced by turbulent air and turbulent
water in the natural environment. Wind in the trees, rain on the
roof, waterfalls, and such. And maybe, 500 million years
ago, animals started using pulse-resonance sounds because
they stood out against the natural background, but that’s
another long story. So that’s all you need to know
about the size information in communication sounds. As you get bigger, your pulse
rate will go down because your vocal cords get longer
and more massive. And as you get bigger, the
resonances will ring slower because they have more
mass, essentially. And this is just a very general
principle in sound production. Now, what about the perception
of communication sounds? I’ve said we’re robust. I just
want to draw your attention to a few experiments that
we’ve put in the literature to prove this. So what I’m going to present you
here, we now have, thanks to Kawahara-san and Irino-san,
we have a computer application that can take my recorded
speech, take it apart cycle by cycle, and then change the
glottal pulse rate and change the vocal tract length
and resynthesize. And this allows us to do nice,
controlled experiments where we have very good control of
glottal pulse rate and vocal tract length, while preserving
all of my individual speaker characters. If you want to do experiments,
this is the tool. So here, most of the slides
I’m going to show you have glottal pulse rate on the
abscissa and vocal tract length or some other measure of resonator size on the ordinate. And here, the pitch is going up
and the vocal tract length is going down. So I’m going to present you
sets of three speakers who differ in pitch and vocal
tract length. And each one of them is going
to say three syllables, and your task is to say whether the
three speakers are saying the same three syllables. So listen carefully. COMPUTER SOUND: Mo, pee, fa. Mo, pee, fa. Mo, pee, fa. Mo, pee, fa. Mo. pee, fa. Mo, pee, fa. ROY PATTERSON: It’s trivial. The change in size, from a young
boy to a teenager to a large adult, is absolutely
trivial for you. So you may think you’ve learned
this because you have a lot of experience
with children and teenagers and adults. But I have some other speakers
here which have combinations of glottal pulse rate and vocal
tract length which you are very unlikely to
have ever heard. A very short vocal tract with
a low pitch and a very long vocal tract with a high pitch,
same person in the middle. Try this one. COMPUTER SOUND: Wu, be, ha. Wu, be, ha. Wu, be, ha. ROY PATTERSON: Trivial. COMPUTER SOUND: Wu, be, ha. Wu, be, ha. Wu, be, ha. ROY PATTERSON: And we can do
it with a fixed pitch and a changing vocal tract length. COMPUTER SOUND: Re, ku, sa. Re, ku, sa. Re, ku, sa. ROY PATTERSON: So that’s the
size component of a perception that is specific to vocal
tract length. And there’s a component
specific to pitch. COMPUTER SOUND: Na, te, zi. Na, te, zi. Na, te, zi. ROY PATTERSON: However, the
point is, you are amazingly robust to this. So that’s the audio demo. And you’re also very good at
telling what the size of the speaker is. I’ll come back to
the recognition problem in a minute. Or maybe I’ll do it right now. So you’re robust to
changes in size. We took the five vowels– a, e, i, o, and u, spoken by
me, and we used this magic program, STRAIGHT, to scale the
glottal pulse rate all the way from my normal– 110– all the way up to 640 and down
to 10 Hz, below the lower limit of pitch where
you’re just hearing individual cycles. And we scaled the size from my
size to double my size and half my time, which is
an enormous range. And we said to the person on
each trial, here’s a vowel. Which of the five vowels
is the person saying? And they had to choose
one of five– five alternative forced
choice test. Now, let me just point out,
these are the mean data, and this ellipse here shows you
the combinations of vocal tract length and pitch of 99%
of the people that you have ever heard. Their combinations
are all here. So small children are here, and
women are here, and men are here, and very large
men are here. So they’re all in there. So when we go beyond that, you
really have not heard speakers of this type before. So how do people do? Well, this is just a very
simple task for them. The 50% chance, which is
d-prime, a statistical measure of the fact that you can do this
reliably, is this black line, which is barely
in the picture. And for all of the yellow and
orange and red, you can do this task way above chance. And I think this dark
color is about 95%. And you can do it all the
way down to below the lower limit of pitch. You’re just amazingly robust
to changes in size. So that’s to say that you can
get the information about the message independent of the
size of the source. At the same time, you’re
peeling off the size information in a useful way– AUDIENCE: [INAUDIBLE] ROY PATTERSON: Sorry? AUDIENCE: [INAUDIBLE] ROY PATTERSON: OK, so we took
the same set of vowels, and they vary in glottal pulse rate
and vocal tract length. And we played single vowels to
people and said, how big is the speaker, from very small,
one, to very large, seven? People are consistent. If you give them a low glottal
pulse rate and a long vocal tract, they say that’s
a very large person. And if you give them a high
glottal pulse rate and a short vocal tract, they reliably say
that’s a very small person. The two dimensions both affect
your perception of size in that both of these come down,
and they interact– the surface is not
simply a plane. So as you go up in pitch, it’s
fairly linear in log glottal pulse rate. But for vocal tract length,
in the range that you have experience with, it’s
fairly flat, then it goes quite steep. When you go to the normal range,
suddenly people start to sound, really, quite
small quite quickly. But it’s a stable source. And another experiment, to
make more speech like, we produced a syllable database,
and this syllable has consonant vowel syllables,
and it has vowel constant syllables, and the consonants
are the three main groups– sonorants, stops,
and fricatives– six of each– and the five cardinal vowels,
a, e, i, o, and u. And that gives us 180 syllables
plus five vowels. And we use that to do some
discrimination experiments, which I’ll tell you
about in a minute. So if you take normal recorded
syllables and just play a sequence of them, it will
sound irregular. And so these syllables have
also been what we call p-centered, basically focused on
the vowel onset, which when you play them in sequences,
makes the rhythm sound even. So then we can draw randomly
from this database and get sequences like, COMPUTER SOUND: Mi, en,
ka, it, so, us. Mi, en, ka, it, so, us. ROY PATTERSON: So STRAIGHT
produces very high-quality vocoding. And we run experiments
like this. So we do experiments in
psychophysics, and we have to be very careful that there
aren’t any extraneous cues that you can use
to do the task. So if we just took two vowels
and said, which of these speakers is larger? As a matter of fact, if you’re
using a log frequency scale and you have three formants,
as the person gets larger, that spectrum moves as a unit
towards the origin. So if you could just glom on to
one of the spectral peaks and compare that spectral peak
between two intervals, you’d be able to do this task. We’re going to present two
intervals and you have to say which speaker is larger. So we make the spectral cues– we scramble the spectral
cues very badly. And we do that by choosing a
selection of syllables at random from that database
for them– and we present those
in interval one– and we give them a picture
contour that is reasonable for speech. And the pitch is also moving
the spectral components all over the place. And then, in the second
interval, we select four syllables again at random. And we give it a different
pitch contour. So there really isn’t any pitch
cue or spectral cue that you could carry and compare
between these two intervals. However, in the first interval,
all of the syllables were constructed with the
same vocal tract length. And in the second interval,
all of the syllables were constructed with that vocal
tract length plus a bit or minus a bit. And you have to say which
one was larger. So, ready? COMPUTER SOUND: Ma,
et, ku, am. Se, om, te, wa. Ma, et, ku, am. Se, om, te, wa. ROY PATTERSON: Which interval
had the smaller speaker? AUDIENCE: [INAUDIBLE] ROY PATTERSON: One person got it
wrong, but I won’t say who. COMPUTER SOUND: Ma,
et, ku, am. Se, om, te, wa. ROY PATTERSON: OK, that’s about
a 10% change in vocal tract length, and after two
minutes training, most people are close to 100% on that. It’s a very natural thing. You just bring people in and
say, the speakers are a different size. Tell me which one’s smaller. And we do that experiment, and
we do it for five different starting points– large male,
small male, small child, castrato and dwarf,
the same ones I’ve been using all along. So now, these are five
psychometric functions for those positions. And you start off here with a
certain vocal tract length, and you compare it to vocal
tract lengths that are longer. And as soon as they’re a bit
longer, then you go to 100% on which one’s smaller. And if we reduce the vocal tract
size, then you go to 0% on the smaller one
because you’re choosing the larger one– you’re inverting the choice. Anyway, the point is that as
you change the vocal tract length away from the starting
point, people rapidly can do this task reliably. This is true for all five
of these speakers. It doesn’t seem to matter
where you start. And the slope of the
psychometric function is about the same in all cases. So we can create a statistic
that psychologists call the just noticeable difference, that
is, the difference you have to make on a dimension
for it to be reliably detectable. And we do that by saying, how
much vocal tract length do you need to raise the percent
correct from 50 to 75? And that’s statistically well
defined as a d-prime of one. Well, it’s a little more
than d-prime of one. So if we do that,
what we find– and now I’m showing you the
data for all six of the different kinds of consonant,
vowel, and vowel-consonant syllables separately, but it
doesn’t really matter. But that’s just why there are
six points on each of these. And this fixed line is 5% in
each of these data plots. And what you can see is that for
the small male, the large male, and the castrato with
the longer vocal tracts– these three– you need about a 4% change in
vocal tract length in order to reliably tell whether the
speaker is larger or smaller. And for the shorter vocal
tracts, you need 6% or 7%. But on average, you need a 5%
change in the size of the resonators. I should have told you, vocal
tract length is very highly correlated with your height. So a 5% change in height will
produce a 5% change in vocal tract length, and it’s the
acoustical scale of the vocal tract that you’re detecting. Now, why is that important? So this supports the idea that
you have access to the acoustic scale dimension in the
brain, and you treat it like a dimension. The just noticeable difference
for loudness of a sound is 10% or 11%. The just noticeable difference
for the brightness of a diffuse light is 14%. The just noticeable difference
for the duration of a sound is 10% or 11%. So 5% percent JND is a very
good JND for perception. It’s not as good as pitch, and
it’s not as good as visual acuity, but it’s a
very good JND. You can also do these
experiments with musical instruments. We published a paper in JASA
on the horns, strings, woodwinds and voice, but I don’t
really have time for that today. So that’s the first point. So I’ve told you about the form
of the size information in communication sounds. And I’ve told you about
the perception. We’re amazingly robust. We seem
to be able to take these sounds apart. And why is that surprising? Well, let’s go back to these
sounds and look at their magnitude spectra. So if I take the Fourier
transform– a windowed Fourier transform– and integrate it and produce
a spectrogram– these are slices of
the spectrogram, or magnitude spectra. And when you change
the pitch– COMPUTER SOUND: Ah, ah, ah. ROY PATTERSON: –you change
all of the frequency components and all
other amplitudes. You just move them all
over the place. And when you change the
vocal tract length– COMPUTER SOUND: Ah, ah, ah. ROY PATTERSON: –now, it is,
actually, the case that if you really have a fixed pitch when
you do that, it would be the same frequency components in
your spectrum because they would all be harmonics
of one fundamental. But all other amplitudes
change. And normally, you have a
combination of these two. So the spectral components are
just moving around in a dizzying way. But there’s something about the
pattern of them that the system can detect. So really, they’re being scaled,
but scaled in two dimensions– glottal pulse rate
and vocal tract length. And recognizers have
trouble with this. Now, to illustrate this– humans find it hard
to believe. Humans believe in recognizers
and they think if you train it on one person, they’ll
be able to recognize another because we can. But we needed to test that. So we took the syllable
database and we didn’t experiment with humans. That is, we took the syllables
and we produced people with all of these different
combinations of glottal pulse rate and vocal tract length. Load’s 56 people. And we trained people on the one
at the very center of this blue spot and eight others
in this blue spot. But basically, we trained them
on people of one size. And training for the humans
means learning to use this GUI where we would present them a
syllable and they had to tell us which syllable have
been presented. And they have to learn the funny
way in which we’re going to use the syllables “ma, me mi,
mo, mu,” and they have to learn to find all
the positions. And that’s basically what
they’re doing, is learning to use the GUI on this. And then, without any further
ado or any further training, this is where those voices
are in the normal space. So here’s 99% of the men you’ve
ever heard, 99% of the women and 99% of the children. This is what are the voices
are that we were using. And we just, all of a sudden,
started presenting– instead of the person at the
center, we presented people at the ends of the spokes. So suddenly, the size of
the person is changing dramatically on every
trial at random. And it makes almost no
difference to humans. That doesn’t surprise you, but
I’m saying, this is our measure of robustness. You can train on one point,
test on any point in the space, and it hardly makes
any difference. It is the case that the
end of this spoke– three and seven, which are
the most unusual voices– your performance is
a touch worse. But that really is a small
effect, and this is 20 observers with a very large
number of trials. All right, keep that in mind. That doesn’t surprise you,
but maybe it should. We took an HMM recognizer– a hidden Markov model
recognizer– and you need about three
emitting stages for recognition of syllables. And we used MFCC features. So this is pretty much an
industry standard kind of recognizer, set up to work
on our syllable database. Clean speech, single
syllables. And we trained the recognizer
on nine voices in this blue region here. And when the recognizer was
trained so that it was producing– this is error rate– so that it was producing an
error rate of 0 on the syllables that it had been
trained on, we then started to change the vocal tract length
on the recognizer. And immediately, as you go away
in vocal tract length, the error rate rises rapidly
and to a disastrous level. So recognizers are not robust
to extrapolation. If you take the same recognizer
and train it on half of those voices from all
over that region and then test it, it does very well. It does 95% if it has seen
all of the data that it has to work on. So it can interpolate
between voices that it has learned from. It can get the one in the middle
between them, but it isn’t any good at
extrapolation. And as you saw in our
recognition experiments, humans can extrapolate
way beyond anything they’ve ever heard. So, the auditory system appears
to adapt to the pulse rate and the acoustic scale
of communication sounds automatically, as if we
have a transform. So what I’m proposing is that
you have a normalizing transform in the system before
you try and do the recognition. And the processing appears
to produce a speaker-size invariant representation of
the message, which is very convenient. So how might the auditory
system do this? To explain this, what I’m
going to do is take four sounds and take them through
several stages of the auditory system to see if we can get them
all to be the same in the sense of the message being the
same, And these are all two-formant vowels, and in all
cases, the formants are just damped resonances, and
they are all scaled versions of one another. So it’s exactly the same object
in different sizes. And I’m going to start with a
long vocal tract and a low pitch here. And then I’ve got one where you
have a long vocal tract still, but you have
a higher pitch– in fact, an octave higher– and then a short vocal tract
and a low pitch. So the pitch is the pulse rate
and the vocal tract length is the resonance rate. So this one rings faster
than this one. And then finally, the
combination of short vocal tract and high pitch. So I’ve got a two-by-two in
which both of these crucial dimensions are variant. Now, I take those four sounds
and I put them into an auditory model. This is my favorite gammatone
auditory filter bank, but you can use almost any filter
bank to do this or any wavelet transform. The auditory system is really
performing a wavelet transform in which the kernel scales
with the carrier. So down here, this triangular
shape here is what you get when a glottal pulse strikes
a single resonator. It rings for a while. And here is the second
formant, and it rings for a while. It’s at a higher frequency. It happens to ring for exactly
the same number of cycles, but since you’re at a higher
frequency, the cycles go by faster, so it dies
away quicker. And this is the form
of a size change. In this representation, the
object moves up and shrinks. So, if we shorten the vocal
tract and we’re on a logarithmic frequency scale
here– logarithmic frequency or acoustic scale– then the shorter vocal tract
transforms both of the formants proportionately, so
they move up as a pair on this scale, and they both shrink. Again, they have the same number
of cycles, but you’re using a different acoustic scale
to represent them, so their shape changes. They are not size invariant
on this representation. So what could you
do to fix this? Well, the obvious thing is to
just change the frequency dimension, so that you pull the
cycles out, so that you’ve got the same number of
cycles in each case. I just want to take this one
and pull it out to there. And for these up here, I say I
want to pull them to the same point because then, I know
they’ve got the same number of cycles in them. And the question is, can
you do that in general? And the answer is, yes,
mathematically you can. What you have to do is take
your time dimension and in each channel, separately,
multiply time times the center frequency of the channel. And frequency is increasing
as we go up. So as we go upward,
we’re steadily expanding the time dimension. And if we do that to this
representation, we get this representation, which
is an expanding neural activity pattern. And I probably should have
said this is the kind of representation we think you
have at the output of the cochlea, flowing up the
auditory nerve to the cochlear nucleus. We call it a neural
activity pattern. And here’s an expanded neural
activity pattern, as if the auditory system treated time
differently in each channel. And you can see that that has
successfully converted these two resonances, which we know
are the same kind, into the same shape in this place. And it’s also converted the
ones in the shorter vocal tract into the same pattern. And pitch doesn’t matter to this
representation, because this is really the ringing
of the vocal tract. And it doesn’t matter what the
period of the vocal chords is unless the period of the glottal
pulse period gets to be shorter than the ringing
of the formant. And actually, in speech,
that does occur. That’s a separate problem, but
I’m not trying to hide it. So this is very nice. We now have representation in
which the message is the same for all four of our
different sounds. The only problem is, this
representation is not time-shift covariant. That is, when we get to the
second cycle, we find that our expansion has messed it up. First of all, it’s put the
glottal pulse line– it’s shifted the glottal
pulse line over. And secondly, this is a log– so the dimension is the log of
the number of cycles of the impulse response. And because of the log,
it’s compressed it in the second cycle. So successive cycles don’t
appear in the same form, and that’s not very good. But what we actually think is
that the auditory system restarts time on each
new glottal pulse. So when it gets here, it really
flips back and puts the next cycle– it’s really measuring time
from each glottal pulse. And as long as it does that,
then it will find all the glottal cycles. There’s not much change between
successive cycles and it’s going to plot them
on top of each other. And we call that strobe
temporal integration. And you have to do that a little
sooner if the pitch is higher, but it’s not
really a problem. So if you do that, you then get
a representation where the cycles are all falling on top
of each other as long as the sound stationary. And this, week we call
the scale-shift covariant Auditory Image. It’s a nice, stable image. It’s an image in which change
occurs if you hear a change, and change does not occur if you
don’t hear a change, which is not true of neural
activity patterns. So in this image,
what happens? Well, you have your two
formants, and they are now the same size. And you can see that it’s
the same formant. And if the size of the person
changes, this pattern just moves vertically as a block. So it’s scale shift covariant
because in the representation, you still have the message and
you have the size information. And they are both there– they covary. So you have both kinds of
information still in there. But now you’ve got them in a
nice, orthogonal form because that pattern just moves
as a block when you change the size. If you change the
pitch, nothing happens to the message. It just stays where it was. What happens when you change
the pitch is that the terminating line for the
glottal period changes. So actually, if you have a
noise, then you can have activity throughout
this entire image. As soon as the sound becomes
periodic in the way that speech does, with glottal
pulses, then you can no longer have information down here. So when a periodic sound comes
on, suddenly, you lose everything below this diagonal,
and the diagonal tells you the pitch. So now you have the
three elements– vocal tract length, glottal
pulse rate, and the message– separated and represented
in an orthogonal form. And this, I think– and a number of people do– think that this representation
of what you hear is a better representation because here,
the message is “ah,” and it doesn’t change. The fact that the child chooses
to say it and an adult chooses to say it, that
doesn’t matter. It’s the same message. And whether they’re a child or
an adult is determined by where the pattern appears
and where this terminating line is. So, can we make use of this? So you remember that we trained
the HMM recognizer on a syllable database of one
speaker, and when we changed the size, it had a problem. What happens if we take features
from the scale shift covariant Auditory Image? So if we make spectral
profiles– just add across the auditory image– you get this green line
in here in behind. So to that, knowing that what
we were looking for is three formants, we fitted three
Gaussians to the spectra as they came along. And we used those features
instead of MFCCs for the recognizer. And we did one more thing. We know that this set of three
formants will move back and forth when size changes. So we did what speech people
from Kajiyama and Fant forward have told us– use formant ratios. It’s the formant ratio on a
log scale that is fixed. So we gave it difference
of the Gaussians. We gave it the heights. And in total, this gives you
somewhat fewer features than the MFC features that you would
normally use, like, about a dozen. And in this case, when we take
the same recognizer operating on the same database and have
it work on a mixture of Gaussian auditory image
profiles, then when you train it on this speaker here, and
then you change the vocal tract length, the error rate
does go up a little as you move away from the training
vocal tract length, but it stays at a reasonable rate
in comparison to going up to 100% error. So we think that these features
are more invariant, and they captured the
information in a more useful way. And from the auditory point of
view, we argue that it’s a better representation
of how you hear. So, in summary, there’s
size information in communication sounds. And I’ve told you about
the perception of communication sounds. Humans are very robust, machines
are not, when you develop a measure that allows
you to compare them. And I’ve shown you how we can
take auditorium models and extend them to produce a
scale-shift covariant representation. And that does, indeed, help
recognition in this environment. So currently, to finish off, the
databases for recognizers currently have thousands of
hours of information from many, many speakers. And they take thousands of hours
to train because there’s so much information in the
database to make them robust. And we think what’s the problem
there is you’re actually having to find enough
different speakers so that the recognizer can hear every
syllable you want to recognize with every different combination
of glottal pulse rate and vocal tract length. And we don’t think the auditory system has to do that. And this is the way
to handle it. Thank you. AUDIENCE: How would you expect
the performance of a conventional recognizer like
that to be benchmarked against if it was trained [INAUDIBLE]
single voice, and [INAUDIBLE] parameter method, using extended
[INAUDIBLE] that way, rather than using [INAUDIBLE] ROY PATTERSON: Exactly. The answer is, we don’t know. The prediction is obviously– I’m sorry. Repeat the question. What would happen in
your database– you make a database of a large
number of speakers because you want your recognizer to be
speaker independent. What if you took all those
speakers and then scaled their speech to a large number of
different combinations of glottal pulse rate and
vocal tract length. And we think that’s a
great idea, and we haven’t yet done it. You’re actually seeing
our research up to about a month ago. So that is exactly what we want
to do, and we’ve been looking for some collaborators
to do it with. So what our prediction has to
be, that you can reduce the number of speakers in your
database and get higher recognition at the same time. And it should take less time
to train your recognizer. We have a strong hypothesis,
and we’re dying to know whether it’s right. AUDIENCE: So your representation
that you did recognition on essentially
comes down to detecting formants and using their
ratios and a few other parameters. Do you need your auditory image
as a way to get to that, or do other kinds of
formant detection just don’t do as well? ROY PATTERSON: And the answer
is, we don’t yet know. Let’s see if I can find– so this was a quick thing to
do that everyone would understand, was to
take the spectral profiles of this image. You know what I want to do. I’ve got another slide
here somewhere. Or have I? So these are those same
scale-shift covariant Auditory Images tarted up to publish
in a trendy journal. But these are the same patterns
you saw before with the same terminator. The thing is, on this slide,
I’ve actually plotted the profiles so that you can see,
these are the spectral ones that we used to find formants. Down here is a temporal
profile. The temporal profile is actually scale-shift invariant. And so, where we expect to make
some more progress is to use this one to constrain
the syllable search. It appears to contain
information like the difference between front
vowels and back vowels. So the answer to your question,
do you need to use the two-dimensional image? We haven’t proven it yet, but we
think that the information in this temporal profile, which
you don’t have in a spectrogram, will
also be useful. We have to figure out how to
bring this information in the temporal profile together with
the information in the spectral profile. But the answer to the question
is, we don’t yet know. AUDIENCE: I was going to ask if
these slides were online. ROY PATTERSON: They will be. We can leave you a copy to go
with the talk, if you’d like. Thank you very much.

One Reply to “A perceptual space that can explain the robustness of…”

  1. Hi! Have you thought about – fast abs magic (search on google)? Ive heard some amazing things about it and my brother in law got really defined 6 pack abs and lost lots of weight with it.

Leave a Reply

Your email address will not be published. Required fields are marked *