Accelerating speed of reading contents from dataframe in pandas-CodePudding

Let us suppose we have table with following dimension :

print(metadata.shape)-(8732, 8)

let us suppose we want to read slice_file_name for each row( and then read sound files from drive ) and extract mel frequencies :

def feature_extractor(file_name):
  audio,sample_rate =librosa.load(file_name,res_type='kaiser_fast')
  mfccs_features =librosa.feature.mfcc(y=audio,sr=sample_rate,n_mfcc=40)
  mfccs_scaled_features =np.mean(mfccs_features.T,axis=0)
  return mfccs_scaled_features

if i use following loops :

from tqdm import tqdm
extracted_features =[]
for index_num, row in  tqdm(metadata.iterrows()):
    file_name = os.path.join(os.path.abspath(Base_Directory),str(row["slice_file_name"]))
    final_class_labels=row["class"]
    data=feature_extractor(file_name)
    extracted_features.append([data,final_class_labels])

it takes in total following amount of time :

3555it [21:15,  2.79it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1323
  n_fft, y.shape[-1]
8326it [48:40,  3.47it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1103
  n_fft, y.shape[-1]
8329it [48:41,  3.89it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1523
  n_fft, y.shape[-1]
8732it [50:53,  2.86it/s]

how can i optimize this code to do the thing in less amount of time? it is possible?

CodePudding user response：

You could try and run the feature extractor in parallel, this could give a new column in your dataframe with the mfccs_scaled_features.

from pandarallel import pandarallel
pandarallel.initialize()

PATH = os.path.abspath(Base_Directory)

def feature_extractor(file_name):
    # If using windows, you may need to put these here~
    # import librosa 
    # import numpy as np
    # import os
    
    file_name = os.path.join(PATH, file_name)
    audio,sample_rate = librosa.load(file_name, res_type='kaiser_fast')
    mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
    mfccs_scaled_features = np.mean(mfccs_features.T, axis=0)
    return mfccs_scaled_features

df['mfccs_scaled_features'] = df['slice_file_name'].parallel_apply(feature_extractor)