Enhancing Real-Time Speech Recognition with hybrid system by using Adam Optimization, CNNs and SVM on GPU
DOI:
https://doi.org/10.17762/msea.v71i4.2265Abstract
ASR systems can be used for a wide range of applications, including virtual assistants, voice search, dictation, and voice-controlled devices. They can also be integrated with other technologies such as natural language processing (NLP) and machine learning to provide even more advanced functionalities, such as sentiment analysis and personalized recommendations. However, it is important to note that ASR technology is not without its challenges, such as dealing with variations in accents, background noise, and speech disorders. Nonetheless, ongoing research and development in this field is expected to lead to further improvements in ASR technology and its applications.
Real-time speech recognition is an important technology that allows machines to transcribe spoken words into written text in real-time. In recent years, hybrid systems that combine multiple approaches, such as deep neural networks (DNNs) and support vector machines (SVMs), have shown promising results in improving the accuracy of speech recognition. In this approach, the convolutional neural network (CNN) is used to extract features from the speech signal, which are then fed into a SVM for classification.
To further enhance the performance of the system, Adam optimization is employed as an algorithm for training the hybrid system. Adam optimization is a stochastic gradient descent (SGD) optimization algorithm that has been shown to perform well in optimizing deep neural networks. To accelerate the processing speed of the system, GPU is utilized for parallel processing. This allows for faster computation and thus enables the system to perform real-time speech recognition.
Overall, this hybrid system using Adam optimization, CNNs and SVM on GPU shows promise in achieving high accuracy and real-time performance in speech recognition. The previous system outperformed for 11 labels with Google TebsorFlow and AIY teams, it contains 105,000 wave audio files and five layer model which achieve accuracy of 94.9% in less training time of 4.5116 sec using GPU. Beyond this our system also work for real time command used form user like ON, OFF, LogOff, Shutdown, Open, Close using this it is possible to operate computer system by blind or any handicap person fluently. The previous system work for deep neural network classification system applied on 65000 WAVE Google’s Tens or flow dataset and AIY commands also it is possible to apply TIMIT slandered dataset of Speech with real time user data also. For feature extraction use hybrid system Mel Spectrogram and LPC extract from the input speech and Adam optimization algorithm perform training on Convolutional neural network (CNN) and Support vector machine. Both SVM and Convolutional Neural Network and SVM system proves to outperform than other models and can achieve accuracy of 98.2% for 6 labels of data as well as real time data recognition with best and accurate accuracy.