Rutgers University, Electrical & Computer Engineering

Participants:

Anthony Locorriere
Mark Acquaye

Advisor:

Prof. Dario Pompili

Goal
Our project explores the limitations of machine learning using support vector machine classification. By extracting biometric voice features we can train a system to identify a particular speaker, key phrases, and even detect if the speaker is in a state of distress.

Contribution
Our system is an extension of existing technologies. Although speaker recognition systems have existed for decades, the typical approach of such systems is to disregard changes in a subject’s voice relating to distress or mood. Here we attempt to develop a reliable classifier that can not only determine a speaker’s identity and a key phrase but also assess the user’s state of wellbeing. Such a system could provide a number of benefits in industrial settings.

Potential Applications
Many large corporations rely on the efficient management of employees to ensure quality performance and maintain safety. Implementation of a speaker recognition system by such entities would provide great benefits in the form of: cross-verification of employee hours in tandem with the current login system, personnel and visitor tracking, confirmation of the performance of routine tasks, and maintaining a safe environment. By training the system with key phrases that the operator must utter at a given check point, the system would confirm that the worker is following through with their tasks, thereby providing a form of positive behavioral adjustment. In addition key phrases and distress detection can be trained into the system for assistance and emergency purposes. If an employee who is working alone has severely injured herself the system would be able to react by calling in medical aid.

Process Overview
Raw data in the form of audio samples are read into the system for training. The samples are normalized for accurate comparison. Then features are extracted and stored. A portion of the support vector machine algorithm is called to build a training model that will define how new input data is classified. Raw test data undergoes the same process of normalization and feature extraction and is compared frame by frame via the SVM algorithm. From here our graphical user interface returns a consensus decision regarding the overall similarity of the test data to the training data.

Abstract

Poster

Video

Speaker Recognition, Phrase Verification, and Distress Detection