I have a dataset (df) like that :
Card Number Amount
0 102******38 22.0
1 102******56 76.0
2 102******38 25.6
and it's load using
import panda as pd
df = pd.read_csv("file.csv")
And I would like to calculate something like :
df["Zscore"] = df["Amount"] - AVERAGE(all X in df["Amount"] who have the same "Number" in df["Card Number"] )
My intuition is something like :
import numpy as np
import statistics as st
df["Zscore"] = df["Amount"] - st.mean(np.where(condition, df["Amount"], 0))
But I can't figure out how to express my condition
After some research, I found a solution using Verticapy
import verticapy.stats as st
df["Zscore"] = (df["Amount"] - st.mean(df["Amount"])._over(["CardNumber"]))
But I need to convert my code using Verticapy, and I would like another way to do that because I have never used Verticapy and don't really want to at the moment.
So do I need to use "np.where()" and in this case is it possible to formulate my condition ? Or do I need to alter my way to attack the problem ?
CodePudding user response:
First, you need to calculate the mean value per card number. Let's calculate that by grouping same card numbers, getting the average amount, and call that 'card_mean':
mean_values = df.groupby('Card Number')\
.mean()['Amount']\
.reset_index()\
.rename(columns={'Amount':'card_mean'})
Then, you want to merge that mean value back into the original dataframe, as a new column, for each 'Card Number' that you have in your original df
df = pd.merge(df, mean_values, how='left', on='Card Number')
This gives you a combined df
with 2 columns: the 'Amount' (which you loaded), and the 'card_mean' per card number (which we just calculated by averaging in step 1)
Now you can go and do you magic with both, i.e., subtract each, average over that difference, etc.. For example:
df['z_score'] = df['Amount'] - df['card_mean']