I have a pandas dataframe with several columns. 2 of them are date and time and others are numerical.
I need to perform fast in-place calculation on the numerical part of the dataframe. Currently I ignore first 2 columns and convert numericals to a numpy and use it further down the code as a numpy.
However I want to keep these processed numericals in the dataframe without touching date and time.
Now:
# tanh norm
def tanh_ret():
data = df.to_numpy()
mu = np.mean(data)
std = np.std(data)
return 0.5 * (np.tanh(0.01 * ((data - mu) / std)) 1)
del df['Date']
del df['Time']
nums = tanh_ret()
del df
What I want: normalize 3 df columns out of 5 in-place
Mind that the dataset is large so I would prefer as less data copy as possible but also reasonably fast.
CodePudding user response:
Create a random pandas dataframe
I consider 5 columns of random values, you can place what you want. The Time and Date columns are set to a constant value.
import datetime as dt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random((100,5)))
now = dt.datetime.now()
df['Time'] = now.strftime('%H:%M:%S')
df['Date'] = now.strftime('%m/%d/%Y')
Inplace numerical processing
def tanh_ret(data):
mu = data.mean()
std = data.std()
return 0.5 * (np.tanh(0.01 * ((data - mu) / std)) 1)
num_cols =df.columns[df.dtypes != 'object']
df[num_cols] = df[num_cols].transform(tanh_ret)
Alternatively:
tan_map = {col: tanh_ret for col in num_cols}
df[num_cols] = df.transform(tan_map)