I have two features rank
and ratings
for different product IDs under different categories scraped from an ecommerce website on different dates.
sample dataframe available here:
import pandas as pd
import numpy as np
import warnings; warnings.simplefilter('ignore')
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
df=pd.read_csv('https://raw.githubusercontent.com/amanaroratc/hello-world/master/testdf.csv')
df.head()
category bid date rank ratings
0 Aftershave ASCDBNYZ4JMSH42B 2021-10-01 61.0 462.0
1 Aftershave ASCDBNYZ4JMSH42B 2021-10-02 69.0 462.0
2 Aftershave ASCDBNYZ4JMSH42B 2021-10-05 89.0 463.0
3 Aftershave ASCE3DZK2TD7G4DN 2021-10-01 309.0 3.0
4 Aftershave ASCE3DZK2TD7G4DN 2021-10-02 319.0 3.0
I want to normalize rank
and ratings
using MinMaxScaler()
from sklearn.
I tried
cols=['rank','ratings']
features=df[cols]
scaler1=MinMaxScaler()
df_norm[['rank_norm_mm', 'ratings_norm_mm']] = scaler1.fit_transform(features)
This normalizes over entire dataset.
I want to do this over each category for each particular date using groupby
.
CodePudding user response:
Use GroupBy.apply
:
file = 'https://raw.githubusercontent.com/amanaroratc/hello-world/master/testdf.csv'
df=pd.read_csv(file)
from sklearn.preprocessing import MinMaxScaler
cols=['rank','ratings']
def f(x):
scaler1=MinMaxScaler()
x[['rank_norm_mm', 'ratings_norm_mm']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['category', 'date']).apply(f)
Another solution:
file = 'https://raw.githubusercontent.com/amanaroratc/hello-world/master/testdf.csv'
df=pd.read_csv(file)
from sklearn.preprocessing import MinMaxScaler
scaler1=MinMaxScaler()
cols=['rank','ratings']
df= df.join(df.groupby(['category', 'date'])[cols]
.apply(lambda x: pd.DataFrame(scaler1.fit_transform(x), index=x.index))
.add_prefix('_norm_mm'))
CodePudding user response:
Use groupby_apply
:
>>> df.groupby(['category', 'date'])[['rank', 'ratings']] \
.apply(lambda x: pd.DataFrame(scaler1.fit_transform(x), columns=x.columns)) \
.droplevel(2).reset_index()
category date rank ratings
0 Aftershave 2021-10-01 0.0 1.0
1 Aftershave 2021-10-01 1.0 0.0
2 Aftershave 2021-10-02 0.0 1.0
3 Aftershave 2021-10-02 1.0 0.0
4 Aftershave 2021-10-05 0.0 0.0