There are four basic steps to performing recognition:
First, digitize the speech that we want to recognize. A simple way to do this is use the microphone which can transform our sound wave into digital wave.
Second, compute features that represent the spectral-domain content of the speech (regions of strong energy at particular frequencies). These features are computed every 10 msec, with one 10-msec section called a frame.
Third, a neural network (also called an ANN, multi-layer perceptron, or MLP) is used to classify a set of these features into phonetic-based categories at each frame. As the neural networks are good at pattern recognition, it will recognize the text form of digital form of sound wave. The outputs are some character string like “$sil
Fourth, a Viterbi search is used to match the neural network output scores to the target words, in order to determine the word that was most likely uttered. It is a dynamic programming algorithm for finding the most likely sequence of hidden states called the Viterbi path. The output of the neural network will not 100% match one word, because of the noise, different pronunciation and many other reasons. The Viterbi search will find out more than one word, which match the output together with the probability of the words, calculated by the algorithm, and it will give the most likely one as the result of the search.
As the explanation above, the neural network is the most important part of speech recognition. It receives the digital wave of sound, and gives a text form ,which can be easily understand by the computer, to represent these digital wave. The results of other steps are based on the output of the neural network.