Home > Blockchain >  Accelerating speed of reading contents from dataframe in pandas
Accelerating speed of reading contents from dataframe in pandas

Time:05-12

Let us suppose we have table with following dimension :

print(metadata.shape)-(8732, 8)

enter image description here

let us suppose we want to read slice_file_name for each row( and then read sound files from drive ) and extract mel frequencies :

def feature_extractor(file_name):
  audio,sample_rate =librosa.load(file_name,res_type='kaiser_fast')
  mfccs_features =librosa.feature.mfcc(y=audio,sr=sample_rate,n_mfcc=40)
  mfccs_scaled_features =np.mean(mfccs_features.T,axis=0)
  return mfccs_scaled_features

if i use following loops :

from tqdm import tqdm
extracted_features =[]
for index_num, row in  tqdm(metadata.iterrows()):
    file_name = os.path.join(os.path.abspath(Base_Directory),str(row["slice_file_name"]))
    final_class_labels=row["class"]
    data=feature_extractor(file_name)
    extracted_features.append([data,final_class_labels])

it takes in total following amount of time :

3555it [21:15,  2.79it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1323
  n_fft, y.shape[-1]
8326it [48:40,  3.47it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1103
  n_fft, y.shape[-1]
8329it [48:41,  3.89it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1523
  n_fft, y.shape[-1]
8732it [50:53,  2.86it/s]

how can i optimize this code to do the thing in less amount of time? it is possible?

CodePudding user response:

You could try and run the feature extractor in parallel, this could give a new column in your dataframe with the mfccs_scaled_features.

from pandarallel import pandarallel
pandarallel.initialize()

PATH = os.path.abspath(Base_Directory)

def feature_extractor(file_name):
    # If using windows, you may need to put these here~
    # import librosa 
    # import numpy as np
    # import os
    
    file_name = os.path.join(PATH, file_name)
    audio,sample_rate = librosa.load(file_name, res_type='kaiser_fast')
    mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
    mfccs_scaled_features = np.mean(mfccs_features.T, axis=0)
    return mfccs_scaled_features

df['mfccs_scaled_features'] = df['slice_file_name'].parallel_apply(feature_extractor)
  • Related