How to Extract Spectrogram of sound in Different Duration?-CodePudding

I am working on a gender classification project by voice

My dataset contains male and female audio files, which are 4528 .wav files.

I want to use the spectrogram to feed the neural network

I did this with the librosa library with the (librosa.core.stft) command and saved the extracted spectrograms from the audio files to a .npz file.

My problem is that my audio files have different lengths (some of them are less than one second and some of them are more than one second).

I want to use recurrent neural networks

these are the shapes of the first five data: Unfortunately, data has a different shapes...

(32, 1025)
(26, 1025)
(40, 1025)
(31, 1025)
(45, 1025)

when I get the shape of my whole dataset:

X = np.array(data["specs"])
print(X.shape)

it returns (4528,) # number of all samples also when I fit the network with input_shape=(32,1025) get me the error:

Failed to convert a NumPy array to a Tensor (Unsupported object type list).

what can I do with this problem?

this is how I extract spectrogram and store them into the .npz file:

def save_spec (npz_path , dataset_path , sample_rate=22050 , hop_lenght=512 , n_fft=2048):
    
    # dictionary for storing data
    data = {"mapping" : [],
            "specs": [],
            "labels": []}
    # loop through all the labels
    for i , (dirpath , dirname , filenames) in enumerate(os.walk(dataset_path)):

        # ensure that we're not at the root level

        if dirpath is not dataset_path:
            # save the semantic label
            dirpath_components = dirpath.split("/") # train/female => [train , "female"]
            semantic_label = dirpath_components[-1]
            data["mapping"].append(semantic_label)

            # process files for a specific gender

            for file in filenames:
                file_path = os.path.join(dirpath,file)
                try:
                    print(Fore.CYAN "Loading File...: {} :".format(file))
                    Signal, sample_rate = librosa.load(file_path, sr=sample_rate)
                except:
                    print(Fore.RED "Loading FAILED...")
                try:
                    print(Fore.BLUE   "\t Extracting Spectrogram...")
                    spectrogram = librosa.core.stft(Signal , n_fft=n_fft , hop_length=hop_lenght)
                    spectrogram = np.abs(spectrogram)
                    spectrogram = spectrogram.T
                except:
                    print(Fore.RED "\t Extracting FAILED...")
                try:
                    print(Fore.YELLOW "\t\t Storing Data...")
                    data["specs"].append(spectrogram.tolist())
                    data["labels"].append(i-1)
                except:
                    print(Fore.RED "\t\t Storing FAILED")
                print(Fore.GREEN "\t\t\t Preprocessing Complete!")
                print(Fore.WHITE "\t\t\tFile: {} : \n".format(file))
                time.sleep(0.1)
    np.savez_compressed(NPZ_PATH , x_train=data["specs"] , y_train=data["labels"] , mapping=data["mapping"])

and this is my network design:

DATA_PATH = "/content/drive/MyDrive/AI/Data/Per-Gender_Rec/data.npz"
DATA = np.load(DATA_PATH , allow_pickle=True)
Y = np.array(DATA["y_train"][0]) 

for i in range(5):
  X = np.array(DATA["x_train"][i]) # return (4528,)
  print(X.shape)
  
Network = Sequential()
Network.add(Flatten(input_shape=(32,1025)))
Network.add(Dense(512 , activation="relu"))
Network.add(Dense(256 , activation="relu"))
Network.add(Dense(64 , activation="relu"))
Network.add(Dense(1 , activation="sigmoid"))

Network.compile(optimizer="adam",
                loss="binary_crossentropy",
                metrics=["accuracy"])
Network.summary()

Network.fit(X , Y , batch_size=32 , epochs=5)

how can I fix that?

CodePudding user response：

Pad your shorter sequences so that they're all the same length

Assuming you had the following. If you wanted to pad below:

arr = np.array([[1,2,3,],[2,4,5]])       # shape (2,3)
arr = np.vstack([arr, np.zeros([1,3])])  # add a row of 0's at the bottom

So pick your biggest one, and experiment to see if padding zeros to the beginning, end or symmetrically works better