matching audio to a text-CodePudding

I have an audio file and a text that corresponds to the speech in this audio file.

Audios file that I'm collecting are from volunteers reading a text provided to them. I want to make an algorithm to match the audio that they recorded with the text to make sure that they actually read the text.

I have not decided on the language but I'm curious if it could be implemented on the web ?

CodePudding user response：

Use a pre-trained automatic speech recognition (ASR) model, e.g. using Python and huggingface, like Facebook's Wav2vec 2.0 model (https://huggingface.co/facebook/wav2vec2-base-960h) or any other ASR model (https://huggingface.co/models?pipeline_tag=automatic-speech-recognition) to get a text transcript of the speech. These are usually language dependent, so you will have to find models that suit your goal.

Process the text you already have into a more similar form to an audio transcript (set to lowercase, remove punctuation etc.).

Then it's up to you how you want to compare the two texts. This also depends on how long the texts can be. If it's just single sentences then you could just compare if they are completely the same. If it's a bit longer you can start with a word-wise matching to see what percentage of the words it gets right/wrong (Word Error Rate). Since many trained ASR models use context to determine the transcript some more advanced (but still character or word-based) text similarity metrics such as BLEU or Levenshtein distance might be more suitable, especially since that handles the problem of extra or left out words for you which might be very difficult to handle in self-created metrics.

Generally you can use the same approaches which are used to evaluate Automatic Speech Recognition models, since you do the same thing (compare a transcript to the expected text). There are repositories and packages for this, e.g. this one and this one.

In any case you need to be aware that a models speech recognition will never be perfect, so a score less than perfect doesn't mean your volunteer didn't follow the script. But if you compare the scores between the volunteers you can get an idea how close they stick to the script as well as generally how clearly they speak.

You should also keep in mind that things like accents, backgrounds noises, audio quality and general similarity between the way your volunteer records and the way the model training data was recorded will influence the scores.