2009年3月13日星期五

Neural Networks in Speech Recognition

Speech is a natural mode of communication for people, but we never realize how complex a phenomenon speech is. Our speech can change widely in terms of the accent, pronunciation, articulation, volume, and speed; moreover, during transmission, our irregular speech patterns can be further distorted by background noise and echoes, as well as electrical characteristics. All these make speech recognition a very complex problem. Speech recognition is basically a pattern recognition problem, and because neural networks are good at pattern recognition, they are widely used in speech recognition.

There are four basic steps to performing recognition:

First, digitize the speech that we want to recognize. A simple way to do this is use the microphone which can transform our sound wave into digital wave.

Second, compute features that represent the spectral-domain content of the speech (regions of strong energy at particular frequencies). These features are computed every 10 msec, with one 10-msec section called a frame.

Third, a neural network (also called an ANN, multi-layer perceptron, or MLP) is used to classify a set of these features into phonetic-based categories at each frame. As the neural networks are good at pattern recognition, it will recognize the text form of digital form of sound wave. The outputs are some character string like “$sil$mid, $front, E>$fric, $mid$sil”. This string represents a word “yes” together with the pronunciation. One important thing to note is that the outputs of the neural network also have a number between 0 to 1 means the probabilities of each category. It doesn’t mean the highest probability is the best; it is only the probabilities used to find out the most likely word with the Viterbi search.


Fourth, a Viterbi search is used to match the neural network output scores to the target words, in order to determine the word that was most likely uttered. It is a dynamic programming algorithm for finding the most likely sequence of hidden states called the Viterbi path. The output of the neural network will not 100% match one word, because of the noise, different pronunciation and many other reasons. The Viterbi search will find out more than one word, which match the output together with the probability of the words, calculated by the algorithm, and it will give the most likely one as the result of the search.

As the explanation above, the neural network is the most important part of speech recognition. It receives the digital wave of sound, and gives a text form ,which can be easily understand by the computer, to represent these digital wave. The results of other steps are based on the output of the neural network.

2009年3月12日星期四

A Interesting Speech Recognition Tool

There is a interesting tool in Office Word 2003, but less people find it. That's a powerful speech recognition tool. We can find it out in the tool menu of Word 2003. When it is open for the first time. You will be asked to spend 15 min to learn how to use it. Try to do it, because there are quite a lot of new functions: It can print out the words you said, it can also be it to control our computers. And another reason is, the computer is trying to train speech recognition, the more it is trained the better it can understand what you are speaking.

It's not a tool that can only print out the words you are speaking, It is more interesting. We can give a order to the computer, and it will do the operation it received. For example, we can ask the computer to open a file on the desktop; give a command to play the slides. And of course, if it is well trained. you can write a email only by speaking.

The only problem is it can not realize which is your command, which is the words it only need to print out. So sometimes when we are speaking someting. the computer will do some strange things. I'm thinking about this problem, May be a better way to slove it, is to add a computer name before the command. like if my computer is called laptop, I will need to say:"laptop, open the file XXX".

Please try this tool, and if you find anything else, please tell me :)

Neural Networks

I was asked about the neural networks. That is a good question for me. I am learning neural networks(Application, algorithm and so on), but I have never think about how to explain whiat is neural networks.


Neural networks, as used in artificial intelligence, have traditionally been viewed as simplified models of neural processing in the brain. Our brain made up of a vast network of computing elements called neurons. Brain organizes this huge number of neurons, each ith weak computing power, into a massively parallel complex network, which these neurons interact with each other dynamically to produce a powerful information processor.

The neural network which we are talking about is an interconnected group of artificial neurons that uses a mathematical or computational model for information processing. In most cases an neural network is an adaptive system that changes its structure based on external or internal information that flows through the network.

Neurons behave as functions, they transduce unbounded input activaction at a time into a bounded output signal. the figure below shows the input and output of neural networks.


Neural network is used for pattern recognition here. I am thinking a simple way to explain it to the audience. For example there are two apples, one is red big apple and the other one is green small. So the input of the first one is red, big, round and quite a lot of other information, and the output of the neural network will tell us it is an apple. The same as the second one input the data small, green and so on, and output will still tell us it's an apple. So what the neural networks do is receive all the information of something and find out what it is. Hope this make sense.

General Introduction of Speech Recognition

My topic is neural networks in speech recognition. Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands and control, data entry, and document preparation. And neural networks is used for many different tasks in several domains and they have proved to be very efficient for learning complex input-output mappings.







A general speech recognition model is shown as below:





The block diagram of the modelinclude:

1, a signal processing module for obtaining a reprezentation of the speech signal;

2, a feature extraction module for identifying the key components of this reprezentation andeliminating redundant information;

3, time alignment and pattern matching algorithms for performing word detection;

4, language processing for selecting a final word string.

neural network is used in the 3rd step. That's just a brief introduction I will show more details in next posts.

2009年3月11日星期三

The Presetation

I did the presetation this monday. That was not a successful one I think. So I am thinking about the mistakes I did, and it may help you if you have a presetation.



First, use any facilities if available(slides, blackboard or anything you want to show). It will be very convenient and ensures that the presentation has a clear structure and something for your listeners to take away. By the way, we don't need too much details on the slides, the audience will have trouble reading the material if there are many words. some outlines include the key words is enough. Do not use the facilities that you are not sure whether they can work well, like what I did in the demon section, the laptop didn't do anything than I supposed.


secound, need to be very clear about how much time you have and stick to that time in preparing your presentation. It's very difficult to 'cut' a PowerPoint presentation at the event itself, so it's a great mistake to run out of time. In the presetation we will speak faster than normal. Before the presetaion I practised for many times, it cost almost 10 min each time. And in last presetaion on monday I used 7 min. Though I skipped the demon section which should cost 1 min, it still 2 min earlier.



Another thing is make appropriate use of pictures. It's a good idea to break up text with illustrations and it is true that a picture is worth a thousand words. I find this quite useful in the question section. To explain a question with a picture, would be easier than show the text.



The last thing is to practise in front of people. I did all the practice alone, and in the presetation, I found it is very different if there are some audiences in front of you. So do prectise with your friends, you will find better during the presetation.



I hope this will helpful, and if you have a presetation, good luck^^.

2009年3月8日星期日

Nihao, Everyone

Hi, everyone

This is a starting of my blog. The blog is set for ACT (Advanced Technical Communications) course. I am supposed to write something in this blog earlier, but it's hard to decide what do I need to write, also because of the problem of language. I have sawn quite a lot of blog before, this is the first time to write someting by myself. All things are difficult before they are easy, that 's my feeling at the moment. But any way, i'm trying to do it better.

I'd like to do a introduction of this course here. In this course we are learning skills of communicating technical information in a wide variety of media and to diverse audiences. We learned how to creat posters, give oral presentations, write effective titles and abstracts for an academic audience, write for a non-academic audience, and peer reviewing. All these things are not as easy as I thought. We need to pay attention to all the details. And let all your audiences, both in your research area or not, understand what you want to tell them. That is a nice course, ake a cup of tea, have some biscuits, and sit around talk with each other. That's more like friends sit together, we feel relax to talk with others, and some ideas come out easily.

My topic for this course is neural network in speech recognition. That's not my research, but I find it is interesting. I used some speech recognition tools in Word 2003, I wander how a computer understand your speech, and follow the order (like open some files etc). That's the reason I decide this topic, and try to find out the reason and do some explanation for you.