Speech To Sign Language

downloadDownload
  • Words 3253
  • Pages 7
Download PDF

Abstract—

Sign language is a visual language that is used by people with hearing difficulties as their natural language of communication. Unlike acoustically conveyed sound patterns that most people use, sign language uses body language and manual communication to fluidly convey the thoughts of a person. It is achieved by simultaneously combining hand shapes, orientation and movement of the hands, arms or body, and facial expressions.

Speech to Sign Language is an attempt to convert audio signals to text and subsequently to a video-stream in sign language so that the speaker maybe able to communicate their thoughts without having to learn sign language themselves. In this paper, we discuss the various approaches of easing the task of communication with hearing impairment and those that communicate using sound.

Click to get a unique essay

Our writers can write you a new plagiarism-free essay on any topic

Like acoustic languages sign language uses different notations in different communities to communicate with each other. The Sign Language used in Britain is called ” Sign Language (BSL)” and that of India is called ”Indian Sign Language (ISL).”

In our approach to solve the problem of communication between an acoustic speaker and that of a hearing-impaired listener, we used the American Sign Language (ASL) to build our dictionary since ASL is standardized and a widely used sign language.

I. Introduction

This research paper presents the details of converting audio signals to text using Google’s speech to text API. Then, using the semantics of Natural Language Processing we break the text into smaller understandable pieces which requires machine Learning as a part. Data sets of predefined sign language are used as the input so that the software can use Artificial Intelligence to display the converted audio into the sign language. Google Speech-to-Text platform converts audio to text by applying powerful neural network models in an easy-to-use API. The API recognizes 120 languages and variants to support global user base. It can process real-time streaming or prerecorded audio, using Googles machine learning technology.

After obtaining the text we use Natural Language Processing to develop miniature pieces of text that can easily be interpreted into sign language. In this particular case, we convert them into the grammar of the American Sign Language.

The next step involves interpreting the grammar into sign language graphics for the end final result. We search in our directory for the phrases in the end final result and display them accordingly. If a phrase cannot be found we convey individual words in the text using the sign language. On the rare occasion of failure in this scenario we display every alphabet in the word.

Natural Language Processing (NLP) is a powerful tool for translation in the human language. This work is responsible for the formation of meaningful sentences from sign language symbols, which can be read out by a normal person. For this the combination of Sign Language Visualization and Natural language processing techniques are used. Vital target of this project is to help deaf/dumb and normal people to ease their day to day life.

II.Related Work

The domain of speech recognition and sign language translation have many applications and each have their own implementations as well. Some of them have been listed below:

A. Sign Language to Text Translator

The main objective is to translate sign language to text/speech. The framework provides a helping hand for speech-impaired to communicate with rest of the world using sign language. This leads to the elimination of the middle person who generally acts as a medium of translation. This would contain a user-friendly environment for the user by providing speech/text output for a sign gesture input. Video of deaf/dumb is taken with the help of camera. This video is then preprocessed using Image processing techniques.

Preprocessing steps are listed below:

  • Framing
  • Segmentation and Tracking
  • Feature Extraction
  • Classification and Recognition

B. Object Detection for Gaming (Kinect)

This project investigates an object detection system that uses both image and three-dimensional (3d) point cloud data captured from the low-cost Microsoft Kinect vision sensor. The system works in three parts: image and point cloud data are fed into two components; the point cloud is segmented into hypothesized objects and the image region for those objects are extracted; and finally, a histogram of oriented gradient (HOG) descriptors are used for detection using a sliding window scheme. We evaluate this system by detecting backpacks on a challenging set of capture sequences in an indoor office environment with encouraging results.

C. Speech Recognition for Artificial Intelligence

Speech recognition or Speech to text requires recording and digitizing those acoustic patterns, conversion of basic linguistic phonemes, composing words from phonemes and performing an analysis of the words in the given context to ensure that the spelling of words matches that of the sounds of them. A simple way to solve this problem is to study the possibility of developing a software architecture using one of the approaches of artificial intelligence applications on Neural Networks where this architecture can distinguish between Sound Signals and Neural Networks of irregular users. The system is trained with fixed weights first. Then the system gives the output match for each of these formats at high speed. The neural network proposed above is based on study of solutions of speech recognition problems and detecting signals.

In various Engineering and Scientific fields, such as biology, psychology, medicine, marketing, computer vision, artificial intelligence and remote sensing, Automatic Recognition, Description, Classification and Grouping patterns are important parameters. Fingerprint pictures, Handwritten words, a human face or the voice signal can be templates.

Recognition or Classification may be one of the following two tasks given the pattern:

  • Supervised Classification: Discriminated analysis: supervised classification relies on an analyst to define the classes that the data are classified into and provide the training data of each defined class.
  • Unsupervised Classification: Unsupervised classification is where the outcomes are based on the software analysis of an pattern without the user providing sample classes.

The problem here is to recognize as to which approach to use, either the Supervised Classification problem where training data for each defined class is provided or as an Unsupervised Classification where the system is responsible to make clusters to define classes and associate objects with them.

Applications include a variety of fields such as email filtering, computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task, computational statistics, which focuses on making predictions using computers, data mining which focuses on exploratory data analysis through unsupervised learning. At the same time, the demand for automatic pattern recognition is growing due to the presence of large databases and strict requirements speed, accuracy and cost. Design of recognition system template essentially consists of the following three aspects:

  • Collection and preprocessing, data reporting
  • Decision-making process
  • Scope dictates the choice of pretreatment technique

Schema view and decision-making models: It is recognized that the problem of clearly defined and sufficiently limited recognition will lead to the introduction of the compact model and simple decision-making strategy. Learning from a set of examples is an important and necessary attribute of most systems of recognition template. The most prominent approaches for pattern recognition are:

  • Matching pattern
  • Statistical classification
  • Syntactic or structural conformity and neural networks

III. Speech Recognition And Conversion To Text

The goal of speech recognition is for a machine to be able to ”hear, understand,” and ”act upon” spoken information. The earliest speech recognition systems were first attempted in the early 1950s at Bell Laboratories, Davis, Biddulph and Balashek developed an isolated digit Recognition system for a single speaker. The goal of automatic speaker recognition is to analyze, extract characterize and recognize information about the speaker identity. The speaker recognition system may be viewed as working in four stages

  • Speech Acquisition or Sign Acquisition
  • Speech Feature Extraction
  • Speech to Text Modelling
  • Matching Technique

A. Speech Acquisition

Speech data contain different type of information that shows a speaker identity. This includes speaker-specific information due to vocal tract, excitation source and behavior feature. The information about the behavior feature also embedded in signal and that can be used for speaker recognition. The speech analysis stage deals with stage with suitable frame size for segmenting speech signal for further analysis and extracting. The speech analysis technique done with following three techniques

  1. Segmentation Analysis: In this case speech is analyzed using the frame size and shift in the range of 10-30 ms to extract speaker information. Studied and made use in segmented analysis to extract vocal tract information of speaker recognition.
  2. Sub Segmental Analysis: Speech analyzed using the frame size and shift in range 3-5 ms is known as Sub segmental analysis. This technique is used to mainly analyze and extract the characteristic of the excitation state.
  3. Supra Segmental Analysis: In this case, speech is analyzed using the frame size this technique is technique is used mainly to analyze and characteristic due to behavior character of the speaker.

B. Speech Feature Extraction

The speech feature extraction in a categorization problem is about reducing the dimensionality of the input vector while maintaining the discriminating power of the signal. As we know from the fundamental formation of speaker identification and verification system, that the number of training and test vector needed for the classification problem grows with the dimension of the given input so we need feature extraction of speech signal.

The different feature extraction technique describe as follows

  • A spectral feature like band energies, formats, spectrum and Cepstral coefficient mainly speaker specific information due to vocal tract.
  • Excitation source feature like pitch and variation in pitch.
  • Long term feature like duration, information energy due to behavior feature

C. Speech to Text Modelling

The objective of modeling technique is to generate speaker models using a speaker-specific feature vector. The speaker modeling technique divided into two classifications speaker recognition and speaker identification.

The speaker identification technique automatically identify who is speaking on basis of individual information integrated in speech signal The speaker recognition is also divided into two parts that means speaker dependant and speaker-independent. In the speaker-independent mode of speech recognition, the computer should ignore the speaker-specific characteristics of the speech signal and extract the intended message. On the other hand in case of speaker recognition machine should extract speaker characteristics in the acoustic signal.

The main aim of speaker identification is comparing a speech signal from an unknown speaker to a database of known speaker. The system can recognize the speaker, which has been trained with a number of speakers. Speaker recognition can also be divide into two methods, text-dependent and text-independent methods. In the text-dependent method the speaker say key words or sentences having the same text for both training and recognition trials. Whereas text-independent does not rely on a specific text being spoken.

D. Matching Technique

Speech-recognition engines match a detected word to a known word using one of the following techniques.

  1. Whole-word matching: The engine compares the incoming digital-audio signal against a prerecorded template of the word. This technique takes much less processing than sub-word matching, but it requires that the user (or someone) prerecord every word that will be recognized – sometimes several hundred thousand words. Whole-word templates also require large amounts of storage (between 50 and 512 bytes per word) and are practical only if the recognition vocabulary is known when the application is developed.
  2. Sub-word Matching: The engine looks for sub-words – usually phonemes and then performs further pattern recognition on those. This technique takes more processing than whole-word matching, but it requires much less storage (between 5 and 20 bytes per word). In addition, the pronunciation of the word can be guessed from English text without requiring the user to speak the word beforehand discuss that research in the area of automatic speech recognition had been pursued for the last three decades.

IV. Text To Sign Language Synthesis

Many machine translation systems for spoken languages are available, but the translation system between the spoken and Sign Language are limited. The translation from Text to Sign Language is different from the translation between spoken languages because the Sign Language is visual spatial language that uses hands, arms, face, and head and body postures for communication in three dimensions. The translation from text to Sign Language is complex as the grammar rules for Sign Language are not standardized. Still a number of approaches are under research for translating the Text to Sign Language in which the input is the text and output is in the form of pre-recorded videos or the animated character generated by computer Avatar.

There are no accurate measurements of how many people use American Sign Language (ASL) – estimates vary from 500,000 to 15 million people. However, 28 million Americans ( 10% of the population) have some degree of hearing loss, and 2 million of these 28 million are classified as deaf. For many of these people, their first language is ASL. The ASL alphabet is ’finger spelled’ – this means all of the alphabet (26 letters, from A-Z) can be spelled using one hand. There are 3 main use cases of fingerspelling in any sign language:

  • Spelling your name
  • Emphasizing a point (i.e. literally spelling out a word)
  • When saying a word not present in the ASL dictionary (the current Oxford English dictionary has 170,000 words while estimates for ASL range from 10,000- 50,000 words)

V. Automatic Speech Recognition

An ASR is the process that converts speech signal into the text message or word sequence, it is also called as speech-to-text system. Speaking is very essential and vital means of conversation in the midst of the people as communication which is basically, the uttermost lenient form to deal with sharing information in the humans. Speech is transmitted in its original form in the ordinary speech communication system without knowing its properties. ASR required to compress an input speech words into small set of data to classify correctly as phonemes and involves creating a words one by one sequentially with foremost matches to the given input signal of speech waveform. It is complicated to convert speech into word sequence without compression of input data. An average rate of uttered sounds is approximately 12 per sec.

There are Many applications of speaker recognition such as data entry, speech-to-text, voice dialing, accessing the database services, telephone banking, telephone shopping by speaker dialing, information services, Forensic Purpose are in existence today. The goal of speech recognition is recognizing the voice in spoken words, also to analyse the speaker by extraction of features, simulating and enacting the information which consist in the input voice signal. The accuracy of an ASR system are prevail by many parameters such as Independence or dependence from speaker, diverse word detection, consecutive word detection, thesaurus and discriminating of available large trained data or particular vocabulary in dictionary, environment like nature of noise, ratio of signal and noise, working status, transducers such as amplitude of band, microphone or telephone, distortion or repetition in channel conditions, also age, gender and physical state of speaker, speech style such normal, quite, shouted voice tone and different pronunciation of each word. After the complete implementation of SignSpeech with further increase in efficiency, a comparison with the above ASR System will be done to have an idea about the success of our web-app which is currently a pilot study.

VI. Vocabulary Building

During each phase of speech/voice recognition training, the words you speak become part of a basic vocabulary stored as your speech/voice recognition files. The program relies upon this vocabulary to recognize and translate your speech efficiently and accurately. In real life, we seldom restrict our speech to basic vocabulary alone. Names, places, and unique terminology are essential to conveying our mes- sages. It is very likely that some of the terminology that speakers use will exceed the basic vocabulary assembled by the program during speech/voice recognition training. When the program attempts to recognize these unfamiliar words, its translation falls to guesswork. Mistranslations may also occur if the spoken word or phrase sounds very similar to the word or phrase that the program translated.

The Vocabulary Builder analyzes the contents of a document file, undergoes tokenization and identifies words not included the programs lexicon. Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. Tokenization is also known as word segmentation. Vocabulary Builder invites you to select and train unfamiliar words or minimal symbols and vowels/consonants so that the speech recognition engine will recognize the words when you speak them.

The Vocabulary Builder can analyze text in list form, or as a normal text passage. Words saved in list format (one word or phrase per line) add to the vocabulary in batch. You can also analyze documents, such as a technical article, chapter summary, or glossary of terms, and add the unknown words at your discretion.

After you enter new vocabulary words and train the program to recognize the words, errors sometimes occur. The speech recognition engine lets you correct such errors using a simple voice-activated command known as Correct That. Building and refining your vocabulary and training the program to recognize new words will improve your accuracy and the effectiveness of the program as a communication access tool.

VII. Conclusion And Future Work

This paper is about a system can support the communication between deaf and ordinary people. The aim of the study is to provide a complete dialog without knowing sign language. The program has two parts. Firstly, the voice recognition part uses speech processing methods. It takes the acoustic voice signal and converts it to a digital text in computer. Secondly, the text is converted into recognizable sign hand movements for the deaf people.

The project gives us the many advantages of usage area of sign language. After this system, it is an opportunity to use this type of system in any places such as schools, doctor offices, colleges, universities, airports, social services agencies, community service agencies and courts, briefly almost everywhere.

One of the most important demonstrations of the ability for communication to help sign language users communicate with each other occurred. Sign languages can be used everywhere when it is needed and it would reach various local areas. The future works are about developing mobile application of such system that enables everyone be able to speak with deaf people.

A project of such caliber has extensive applications given the technologies used to achieve its goal. Some of them include the following:

  • We can further extend this project as to recognize the sign movements of a person and convert them into text and furthermore speech so as to seem as if the person were speaking itself. Since dumb people are usually deprived of normal communication with other people, they have to rely on an interpreter or some visual communication. Now the interpreter can not be available always, so this project can help eliminate the dependency on the interpreter.
  • The system can be extended to incorporate the knowledge of facial expressions and body language too so that there is a complete understanding of the context and tone of the input speech. This include understanding the emotions and expressions of the person so as to predict the sentiments in the speech. It can be used to better portray what is it being communicated.
  • Mobility and portability can be included in this project so as to increase the reach of this technology to as many people as possible. A mobile and web-based version of the application will increase the reach to more people.
  • Integrating hand gesture recognition system using computer vision for establishing 2-way communication system.

References

  1. Himali Junghare, Prof. Prashant Borkar ”Efficient Methods and Implementation of Automatic Speech Recognition System”. Department of Computer Science, G.H. Raisoni College of Engineering, Nagpur University, Nagpur, India.
  2. K. Gaikwad Santosh, W. Gawali Bharti and Pravin Yannawar ”A Review on Speech Recognition Technique” Department of CS IT, Dr.Babasaheb Ambedkar Marathwada University, Aurangabad
  3. GeeksforGeeks: Audio to Sign Language Translator ”https://www.geeksforgeeks.org/project-idea-audio-sign-language-translator/”
  4. Handspeak ”http://www.handspeak.com/”

image

We use cookies to give you the best experience possible. By continuing we’ll assume you board with our cookie policy.