For a given column pair I'd like to calculate the difference between them two.
df
customer_id total_spent_21_q4_ total_spent_22_q1_ time_spent_online_21_q4_ time_spent_online_22_q1_
132 394 439 29 32
222 482 502 24 26
492 923 678 24 24
This is what I have tried and it seems to work but I am not able to add back the original column names as well as '_diff' as a suffix
d = {'customer_id': [132, 222,492], 'total_spent_21_q4_': [394, 482,923],'total_spent_22_q1_':[439,502,678],
'time_spent_online_21_q4_':[29,24,24],'time_spent_online_22_q1_':[32,26,24]}
df = pd.DataFrame(data=d)
df.replace('null', np.nan, inplace=True)
df_2 = df.set_index('customer_id')
df_2.columns = df_2.columns.str.extract('(\d )', expand=False)
a = np.arange(len(df_2.columns)) // 2
df_3 = df_2.iloc[:, 1::2].sub(df_2.iloc[:, ::2].to_numpy())
diff_table = df_3
diff_table = diff_table.round(decimals = 2)
diff_table.reset_index(inplace=True)
diff_table
I get the following output:
customer_id 22 22
132 45 3
222 20 2
492 -245 0
desired output df
customer_id total_spent_diff time_spent_online_diff
132 45 3
222 20 2
492 -245 0
EDIT: Version 2 using multi-index
Using the below code I don't get the desired output df and the subtraction values are inaccurate.
df.replace('null', np.nan, inplace=True)
df_2 = df.set_index('customer_id')
df_2.columns = df_2.columns.str.split(r'_(q\d )_', expand = True)
keys = df_2.columns.get_level_values(0).unique()
df_2 = [df_2[key].agg(np.subtract.reduce, axis = 1).mul(-1)
for key in keys]
df_2 = pd.concat(df_2, axis = 1, keys = keys)
df_2.add_suffix('_diff').reset_index()
diff_table = df_2
diff_table = diff_table.round(decimals = 2)
diff_table.reset_index(inplace=True)
CodePudding user response:
One option is to groupby on the columns, after splitting on either total_spent
or time_spent_online
:
temp = df.set_index('customer_id')
split = temp.columns.str.split('(total_spent|time_spent_online)').str[1]
grouped = temp.groupby(split, axis = 1)
grouped.agg(np.subtract.reduce, axis = 1).mul(-1).reset_index()
customer_id time_spent_online total_spent
0 132 3 45
1 222 2 20
2 492 0 -245
Another option is to split on the numbers and characters after the numbers:
temp = df.set_index('customer_id')
temp.columns = temp.columns.str.split('(_\d . )', expand = True)
(temp
.groupby(level = 0, axis = 1)
.agg(np.subtract.reduce, axis = 1)
.mul(-1)
.reset_index()
)
customer_id time_spent_online total_spent
0 132 3 45
1 222 2 20
2 492 0 -245
Another option is to convert to long form, using pivot_longer from pyjanitor, and then aggregate; I prefer the previous options as converting to long form is an operation you can do without, (and avoid extra performance costs).
# pip install pyjanitor
import janitor
import pandas as pd
(df
.pivot_longer(
index = 'customer_id',
names_to = ('total_spent', 'time_spent_online'),
names_pattern = ('total_spent', 'time_spent_online'))
.groupby('customer_id')
.agg(np.subtract.reduce)
.mul(-1)
.reset_index()
)
customer_id total_spent time_spent_online
0 132 45 3
1 222 20 2
2 492 -245 0
CodePudding user response:
You can remove the following renaming column header line
df_2.columns = df_2.columns.str.extract('(\d )', expand=False)
Then rename the diff_table
columns before reset_index
based on the integer in your original column header
import re
diff_table.rename(columns=lambda col: re.split(r'(\d )', col)[0] 'diff', inplace=True)
diff_table.reset_index(inplace=True)