Fast numpy operation on part of dataframe-CodePudding

I have a pandas dataframe with several columns. 2 of them are date and time and others are numerical.

I need to perform fast in-place calculation on the numerical part of the dataframe. Currently I ignore first 2 columns and convert numericals to a numpy and use it further down the code as a numpy.

However I want to keep these processed numericals in the dataframe without touching date and time.

Now:

# tanh norm
def tanh_ret():
    data = df.to_numpy()
    mu = np.mean(data)
    std = np.std(data)
    return 0.5 * (np.tanh(0.01 * ((data - mu) / std))   1)

del df['Date']
del df['Time']
nums = tanh_ret()
del df

What I want: normalize 3 df columns out of 5 in-place

Mind that the dataset is large so I would prefer as less data copy as possible but also reasonably fast.

CodePudding user response：

Create a random pandas dataframe

I consider 5 columns of random values, you can place what you want. The Time and Date columns are set to a constant value.

import datetime as dt
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.random((100,5)))
now = dt.datetime.now()

df['Time'] = now.strftime('%H:%M:%S')
df['Date'] = now.strftime('%m/%d/%Y')

Inplace numerical processing

def tanh_ret(data):
    mu = data.mean()
    std = data.std()
    return 0.5 * (np.tanh(0.01 * ((data - mu) / std))   1)

num_cols =df.columns[df.dtypes != 'object']
df[num_cols] = df[num_cols].transform(tanh_ret)

Alternatively:

tan_map = {col: tanh_ret for col in num_cols}
df[num_cols] = df.transform(tan_map)

Source