Voice Separation, Diarization and Transcription

Background

Extracting the spoken word from audio files using speech recognition technology provides a sound base for these tasks, but transcripts do not capture all the information the audio contains. Other technologies are needed to extract the meta-data that can make the transcripts more readable and provide context and information beyond a simple word sequence. Allocating words to specific speakers and identifying sentence boundaries are examples of such meta-data. Both of these things help provide a richer transcription of the audio, making transcripts more readable and potentially helping with other tasks such as summarization, parsing and machine translation.

Audio Source Separation, also known as the Cocktail Party Problem, is one of the hottest topics in audio because of its practical use in so many situations – identifying the vocals from a song, for instance. Or helping deaf people hear a speaker in a noisy area. Or isolating the voice in a phone call if they’re riding a bike against the wind.

Another hot topic is speaker diarization. Also known as speech segmentation and clustering, this is a method of deciding who spoke when. In this instance, speech and nonspeech sounds from an audio file are isolated, and speaker changes are marked in the detected speech. Nonspeech is a general class consisting of music, silence, noise, and so forth, that need not be broken up by type. Speaker diarization allows us to search audio files by speaker, makes transcripts easier to read, and provides information that can be used for speaker adaptation in speech recognition systems.

Finally, audio transcription is the process of converting audio recordings into written text. There are lots of applications for this. You might need to create subtitles for a video, for instance. Or share a written transcript of an interview. There can also be cases where a transcription is needed for a low-resource spoken language. However, if done manually, transcription can be a time-consuming and tedious task. Even if you’re just transcribing an interview for a blog post, it can take hours. And if you’re working on a project with multiple audio files, it can take days or even weeks.

AI is increasingly being used to speed up the transcription process by automatically transcribing the audio. This can be a game changer for businesses that have a lot of audio content that needs to be transcribed.

In this “multi-channel/speaker source separation” scenario, for a dataset containing both the mixed audio and the individual voices, we train a model to separate the audio of multiple unseen speakers.

Similar to the state-of-the-art methods currently available on the market, we train a single model for the number of speakers we want to separate. The gap in performance of the obtained model, in comparison to published methods, increases as the number of speakers increases.

Our approach employs a mask-free method with a sequence of bi-directional RNNs that are applied to the audio. As shown in this experiment, it is beneficial to evaluate any errors after each RNN is applied, obtaining a compound loss that reflects the reconstruction quality after each layer. Each RNN block is built with a specific type of residual connection, where two RNNs run in parallel. The output of each layer is the concatenation of the element-wise multiplication of the two RNNs together with the layer input that undergoes a bypass connection.

Business Use Cases and Applications

There are multiple direct and indirect applications of this experiment. Some of them include:

Call center data analysis – Speech analytics can be used to analyse voice recordings or live customer calls to find useful information and improve the quality of service. Such procedures, which combine both speaker diarization and voice separation, identify words and phrases spoken by different people in the call, which can be used for quality assurance and training purposes.

Medical records – Many patients today find themselves having online or phone consultations rather than talking to a practitioner face to face. In this instance, it is important to separate the speech of both the doctor and the patient. In some cases, it can be used to improve patient treatments. In the wake of the recent pandemic, a lot of research has taken place in this field where speech separation can help a doctor determine an ailment in a comprehensive way. Also, the transcripts that result from audio separation can augment medical records, allowing practitioners to track a patient’s progress over a period of time and prepare for each appointment.

Audio transcription – Converting audio files into text with automatic speaker diarization. This approach enriches the understanding from automatic speech recognition, which is important for downstream applications such as analytics for call and meeting transcriptions.

Research interview response analysis – In various scenarios, especially in cases where an interview is conducted with different stakeholders, where there are bound to be different opinions and analysis, speaker diarization and audio separation will play a vital role in understanding what each interviewee is explaining. And, at the same time, understanding the questions asked by each interviewer. This will help to improve market research, where we start exploring the unstructured data domain to strengthen our analysis based on various research interviews available online. It can also allow us to separate individual speakers to ascertain what questions were asked specifically and the answers already given by subject matter experts in that domain.

Additionally, there are possible indirect use cases for audio transcription which are explained below:

It allows information to be catalogued in a text-based format, making it easier to analyse results. The data in text format could later be scaled for sophisticated tasks like information extraction, emotion analysis, topic identification, etc.
Transcripts help to increase search engine optimization (SEO) rankings and can be used to repurpose audio into additional marketing materials. Similarly, subtitles improve video accessibility, average watch times and overall views.
Audio transcriptions allow organisations to share important data with stakeholders through reports based on a text copy of a recording.

To enjoy more of this content, please enter your details to receive full access.

Do you have an AI use case you want to explore?

If you’ve got a specific AI use case that you’d like some help exploring, if you’re interested in collaborating with us as partners, or if you’re just interested in finding viable and effective ways to apply AI in your organisation, speak to us today.

hbspt.forms.create({ region: "na1", portalId: "2495356", formId: "3c5c77b2-406f-4cd8-a872-c3aa17c17f73" });