Given text Google speech-to-text audio, get list of words at time stamps? Unity C#-CodePudding

My goal is to lip-sync an avatar which has blend shapes for phonemes like "a" and "m". I have the text (which is generated dynamically via GPT-3 and thus not knowable in advance) and feed it to the Google text-to-speech API, which gives me a raw audio file in return. Can I somehow analyze this file or do something else to know what word is spoken at what time when I play the audio clip? This would help me parse the words into phonemes and adjust the mouth accordingly. Thanks!

CodePudding user response：

You can you Unity asset called SALSA LipSync Suite.

CodePudding user response：

You can use AudioClip.GetData. Basically, it gets the data of a specified audio clip. Note that with compressed audio files, the sample data can only be retrieved when the Load Type is set to Decompress on Load in the audio importer. Do not use compressed audio

You can use this in combination with AudioSource.timeSamples which returns what sample the audio source is currently on.

Every update you can change how much the lips seperate by the volume of the current time sample.

public float volMultiplier;

float[] clipData;
AudioSource aSrc;
AudioClip ac;

void Start()
{
    aSrc = GetComponent<AudioSource>();
    ac = aSrc.clip;
    clipData = new float[ac.samples];
    ac.GetData(clipData, 0);
}
void Update()
{
    if (ac.isPlaying)
    {
         float curVol = clipData[ac.timeSample];
         curVol *= volMultiplier;
         // set positon of lips given this volume
         // volume is -1 to 1 by default. (when volMultiplier
         //is 1)
    }
}