I'm currently studying the singing language identification problem (and the basics of machine learning). I found lots of works about this on the internet, but some of them don't provide any code (or even pseudocode) and that's why I'm trying to reproduce them using their machine learning model description.
A good example is LISTEN, READ, AND IDENTIFY: MULTIMODAL SINGING LANGUAGE IDENTIFICATION OF MUSIC
written by Keunwoo Choi and Yuxuan Wang.
To sum up, they are concatenating two layers: audio layer (in form of spectrogram), text layer (language probability vector on metadata using langdetect, 56-dimensional vector).
The text branch is a 3-layer MLP where each layer consists of a 128-unit fully-connected layer, a batch normalization layer, and a ReLU activation [22].
For text model I got something like this:
text_model = Sequential()
text_model.add(Input((56,), name='input'))
text_model.add(BatchNormalization())
text_model.add(Dense(128, activation='relu'))
langdetect.detect_langs(metadata)
returns [de:0.8571399874707945, en:0.14285867860989504]
.
I m not sure I've described my model correctly and I cannot understand how to put it properly (langdetect probability vector) into keras model.
CodePudding user response:
First, you need to transform the langdetect
output into vector of a constant length. There are 55 languages in the library, therefore we need to create vector of length 55, where i-th element represents the probability of text coming from the i-th language. You could do this like this:
import tensorflow as tf
import numpy as np
import langdetect
langdetect.detector_factory.init_factory()
LANGUAGES_LIST = langdetect.detector_factory._factory.langlist
def get_probabilities_vector(text):
predictions = langdetect.detect_langs(text)
output = np.zeros(len(LANGUAGES_LIST))
for p in predictions:
output[LANGUAGES_LIST.index(p.lang)] = p.prob
return tf.constant(output)
Then you need to create a model with multiple inputs. This can be done using functional API, e.g. like this (change your inputs according to your use case):
def create_model():
audio_input = tf.keras.Input(shape=(256,))
langdetect_input = tf.keras.Input(shape=(55,))
x = tf.keras.layers.concatenate([audio_input, langdetect_input])
x = tf.keras.layers.Dense(128, activation='relu')(x)
output = tf.keras.layers.Dense(55)(x)
model = tf.keras.Model(
inputs={
'audio': audio_input,
'text': langdetect_input
},
outputs=output)
return model
Testing the model on some input:
model = create_model()
audio_input = tf.constant(np.random.rand(256))
langdetect_input = get_probabilities_vector('This is just a test input')
model({
'audio': tf.expand_dims(audio_input, 0),
'text': tf.expand_dims(langdetect_input, 0)
})
>>> <tf.Tensor: shape=(1, 55), dtype=float32, numpy=
array([[ 0.23361185, 0.19011918, -0.45230836, -0.0602392 , -0.20067683,
0.9698535 , -1.0724173 , 0.08978442, 0.052798 , -0.16554174,
0.9238764 , 1.0331644 , 0.4508734 , -0.2450786 , -1.0605856 ,
0.3239496 , -1.0073977 , -0.2129285 , -0.6817296 , 0.05288622,
0.9089616 , -0.11521344, 0.25696573, -0.07688305, -0.36123943,
-0.0317415 , -0.18303779, 0.13786468, 0.88620317, 0.11393422,
-0.5215691 , -0.28585738, 0.54988045, -0.02300271, -0.4347821 ,
-0.57744324, 0.14031887, 0.8255624 , -0.13157232, -1.1060234 ,
-0.24097277, 0.12950295, 0.4586677 , 0.37702668, 0.7558856 ,
-0.05933011, 0.53903174, 0.27433476, -0.18464057, 1.0673125 ,
-0.05723387, -0.03429477, 0.4431308 , -0.14510366, -0.28087378]],
dtype=float32)>
I am expanding the dimensions of the inputs using expand_dims function so that the inputs have shapes (1, 256)
and (1, 55)
(which is similar to inputs (batch_size, 256)
and (batch_size, 55)
that the model expects during training).
This is just a draft, but this is roughly how your problem could be solved.