The following article has been re-posted, as-is, from the now defunct SubjectiveMachine blog. First published Oct 7, 2011.
I’m exploring a relatively new idea yesterday and today.
My PhD research, which I have just completed, focuses on extracting “meaning” from language that is in digital written form (e.g. text in web pages). However, the ideas and techniques I’ve been applying have the potential to work with all sorts of “sensory” data. That fact, along with the problem of translating these things into practical (i.e. fast, scalable) software, and exploring the resultant applications, will be the focal points of my fellowship at the University of Leeds over the coming year.
Just recently I’ve been considering the problem of spoken language. If we are to build agents that operate intelligently in the real world (what are often referred to as “embodied” artificial intelligences – but lets just be done with it and call them robots), then spoken language is something we would like them to be able to deal with.
Audio as presented by a microphone or an eardrum is unsuitable for many things that we might want do with it. These devices work by sensing sound pressure levels. The data that comes out of them is therefore a measurement of amplitude (informally the volume) that changes with time. In communication, what is generally more important this amplitude is the rate at which it changes, or the frequency : different vowel sounds for instance tend to be identified by a unique combination of just two or three “formant” frequencies; and the tone and timbre that identify the speaker and the emotional content are probably also defined at least partly in terms of frequency (“probably” means I’ve not done enough background research to know what I’m talking about).
The standard apparatus for identifying the frequencies in a sound signal is the Fourier Transform. It works by looking for the presence of pure tones (sine waves) of every frequency in the sound signal. The result is simply a report of the average amplitude (and optionally, phase) of each frequency present, without any specific reference to time. Surprisingly this representation is sufficient to uniquely describe and to reproduce the original sound (consider that frequency does – by definition – describe how the sound changes over time; so while the high-frequency components are perceived as tones, the very low frequency components correspond more to the envelope or shape of the sound that we hear).
While this is quite a neat trick (its realisation made Joseph Fourier a very famous chap), it is not ideal for analysing [non-stationary] signals such as speech. This is because during communication we want to analyse the frequency content of speech as we hear it; we don’t want to collect it all first, apply a Fourier Transform, and then be left with the task of trying to figure out which frequency components correspond to which words and moments in the speech.
The standard solution to this problem is to instead apply a Short-Time Fourier Transform (STFT). This involves breaking the sound up into chunks (or windows) of some short duration, and applying a Fourier Transform to each chunk to deduce its frequency components. The result is a spectrogram which shows how the frequency content of the signal changes with time.
This is a fairly obvious and easily-implemented solution, but there are a few problems with it. Coincidentally these are precisely some of the problems that I attempted to address with my Bachelors and Masters dissertations, and with a considerable part of my PhD thesis. In those cases however I was not considering sound signals, but habitat distributions in Biogeography, and word distributions in Natural Language. However, if there’s one thing worth taking from this very first blog entry and adding to a list of general observations about this world, its that the same things tend to crop up all over the place.
In order to perform a STFT, we have to decide upon some sensible window length. This, unfortunately, involves a trade-off. Choosing to look only at very short chunks will mean that we throw away information about lower frequencies, because our chunks are simply not long enough to accommodate or for us to observe such things. Choosing chunks that are too long means that we will miss rapid changes in the frequency content which describe the “shape” of the sound.In short, there is a trade-off between temporal resolution and frequency resolution. In speech recognition the best compromise is found with windows of around 25 millisecond duration (see this 2010 paper by Kildip et al). This is not to say that the situation cannot be significantly improved though. In an ideal world we’d be able to have our cake and eat it: we’d get the temporal resolution of the time-domain signal that we started with, but the convenience of a frequency-domain signal that is given by a Fourier analysis (i.e. with all the nice frequency information that it pulls out and presents for us).
Improving upon the situation is precisely what I am poking around with right now. If my primary hunch is correct then there is a way we can do this which does not involve defining an arbitrary window size, and consequently involves no trade-off. If my secondary hunch is correct, we might actually be able to do this considerably faster than by applying the standard windowed STFT approach. We might also get some other qualities thrown into the bargain.
I will post up any findings as and when they occur.