2009年3月13日星期五

Neural Networks in Speech Recognition

Speech is a natural mode of communication for people, but we never realize how complex a phenomenon speech is. Our speech can change widely in terms of the accent, pronunciation, articulation, volume, and speed; moreover, during transmission, our irregular speech patterns can be further distorted by background noise and echoes, as well as electrical characteristics. All these make speech recognition a very complex problem. Speech recognition is basically a pattern recognition problem, and because neural networks are good at pattern recognition, they are widely used in speech recognition.

There are four basic steps to performing recognition:

First, digitize the speech that we want to recognize. A simple way to do this is use the microphone which can transform our sound wave into digital wave.

Second, compute features that represent the spectral-domain content of the speech (regions of strong energy at particular frequencies). These features are computed every 10 msec, with one 10-msec section called a frame.

Third, a neural network (also called an ANN, multi-layer perceptron, or MLP) is used to classify a set of these features into phonetic-based categories at each frame. As the neural networks are good at pattern recognition, it will recognize the text form of digital form of sound wave. The outputs are some character string like “$sil$mid, $front, E>$fric, $mid$sil”. This string represents a word “yes” together with the pronunciation. One important thing to note is that the outputs of the neural network also have a number between 0 to 1 means the probabilities of each category. It doesn’t mean the highest probability is the best; it is only the probabilities used to find out the most likely word with the Viterbi search.


Fourth, a Viterbi search is used to match the neural network output scores to the target words, in order to determine the word that was most likely uttered. It is a dynamic programming algorithm for finding the most likely sequence of hidden states called the Viterbi path. The output of the neural network will not 100% match one word, because of the noise, different pronunciation and many other reasons. The Viterbi search will find out more than one word, which match the output together with the probability of the words, calculated by the algorithm, and it will give the most likely one as the result of the search.

As the explanation above, the neural network is the most important part of speech recognition. It receives the digital wave of sound, and gives a text form ,which can be easily understand by the computer, to represent these digital wave. The results of other steps are based on the output of the neural network.

没有评论:

发表评论