Home > OS >  Normalizing a huge python dataframe
Normalizing a huge python dataframe

Time:03-31

I have a huge csv file (~2GB) that I have imported using Dask. Now I want to normalize this dataframe. The dataframe contains about 70k columns. I have written this python function to calculate this:

def normalize(df):
   result = df.copy()
   for col in tqdm(df.columns):
     if col!=str('name')  #basically not to normalize columns with name of "name"
        max_value = df[col].max()
        min_value = df[col].min()
        result[col] = (df[col] - min_value) / (max_value - min_value)
   return result

It works okay but takes a lot of time. I put this on execution and its showing it will take appoximately 88 hours to complete. I tried switching to sklearn's minmaxscaler() but it doesn't show any progress of normalization and I am afraid that it will also take quite a lot of time. Is there any other way to normalize all the columns (and ignore a few like I did in that if condition).

CodePudding user response:

You don't need to loop through this. When the other columns than name are numerical values then you can just do something along the following:

num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())

Here is a minimal code sample:

import pandas as pd

df = pd.DataFrame({"name": ["a"]*4, "a": [2,3,4,6], "b": [9,5,2,34]})
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())

print(df)

CodePudding user response:

I am afraid that it will also take quite a lot of time

Then considering that you just need numerical operations I suggest using numpy for actual number crunching and pandas only for extraction of columns to process, simple example:

import numpy as np
import pandas as pd
df = pd.DataFrame({'name':['A','B','C'],'x1':[1,2,3],'x2':[4,8,6],'x3':[10,15,30]})
num_arr = df[['x1','x2','x3']].to_numpy()
mins = np.min(num_arr,0)
maxs = np.max(num_arr,0)
result_arr = (num_arr - mins) / (maxs - mins)
result_df = pd.concat([df[['name']],pd.DataFrame(result_arr,columns=['x1','x2','x3'])],axis=1)
print(result_df)

output

  name   x1   x2    x3
0    A  0.0  0.0  0.00
1    B  0.5  1.0  0.25
2    C  1.0  0.5  1.00

Disclaimer: this solutions assumes that df has indices like 0,1,2,...

If you would need further speed increase consider using parallelization, which might be used in this case as values in each columns are computed independently from other columns.

  • Related