Home > Net >  Pandas - Subtraction in column pairs
Pandas - Subtraction in column pairs

Time:04-17

For a given column pair I'd like to calculate the difference between them two.

df

customer_id total_spent_21_q4_  total_spent_22_q1_  time_spent_online_21_q4_ time_spent_online_22_q1_
132         394                 439                 29                       32  
222         482                 502                 24                       26
492         923                 678                 24                       24

This is what I have tried and it seems to work but I am not able to add back the original column names as well as '_diff' as a suffix

d = {'customer_id': [132, 222,492], 'total_spent_21_q4_': [394, 482,923],'total_spent_22_q1_':[439,502,678],
     'time_spent_online_21_q4_':[29,24,24],'time_spent_online_22_q1_':[32,26,24]}
df = pd.DataFrame(data=d)

df.replace('null', np.nan, inplace=True)
df_2 =  df.set_index('customer_id')
df_2.columns = df_2.columns.str.extract('(\d )', expand=False)
a = np.arange(len(df_2.columns)) // 2
df_3 = df_2.iloc[:, 1::2].sub(df_2.iloc[:, ::2].to_numpy())
diff_table = df_3
diff_table = diff_table.round(decimals = 2)
diff_table.reset_index(inplace=True)
diff_table

I get the following output:

customer_id   22                22
132           45                3
222           20                2
492          -245               0

desired output df

customer_id   total_spent_diff  time_spent_online_diff
132           45                3
222           20                2
492          -245               0

EDIT: Version 2 using multi-index

Using the below code I don't get the desired output df and the subtraction values are inaccurate.

df.replace('null', np.nan, inplace=True)
df_2 = df.set_index('customer_id')
df_2.columns = df_2.columns.str.split(r'_(q\d )_', expand = True)
keys = df_2.columns.get_level_values(0).unique()
df_2 = [df_2[key].agg(np.subtract.reduce, axis = 1).mul(-1)
       for key in keys]
df_2 = pd.concat(df_2, axis = 1, keys = keys)
df_2.add_suffix('_diff').reset_index()
diff_table = df_2
diff_table = diff_table.round(decimals = 2)
diff_table.reset_index(inplace=True)

CodePudding user response:

One option is to groupby on the columns, after splitting on either total_spent or time_spent_online:

temp = df.set_index('customer_id')
split = temp.columns.str.split('(total_spent|time_spent_online)').str[1]
grouped = temp.groupby(split, axis = 1)
grouped.agg(np.subtract.reduce, axis = 1).mul(-1).reset_index()

   customer_id  time_spent_online  total_spent
0          132                  3           45
1          222                  2           20
2          492                  0         -245

Another option is to split on the numbers and characters after the numbers:

temp = df.set_index('customer_id')
temp.columns = temp.columns.str.split('(_\d . )', expand = True)
(temp
.groupby(level = 0, axis = 1)
.agg(np.subtract.reduce, axis = 1)
.mul(-1)
.reset_index()
)

   customer_id  time_spent_online  total_spent
0          132                  3           45
1          222                  2           20
2          492                  0         -245

Another option is to convert to long form, using pivot_longer from pyjanitor, and then aggregate; I prefer the previous options as converting to long form is an operation you can do without, (and avoid extra performance costs).

# pip install pyjanitor
import janitor
import pandas as pd

(df
.pivot_longer(
    index = 'customer_id', 
    names_to = ('total_spent', 'time_spent_online'), 
    names_pattern = ('total_spent', 'time_spent_online'))
.groupby('customer_id')
.agg(np.subtract.reduce)
.mul(-1)
.reset_index()
)

   customer_id  total_spent  time_spent_online
0          132           45                  3
1          222           20                  2
2          492         -245                  0

CodePudding user response:

You can remove the following renaming column header line

df_2.columns = df_2.columns.str.extract('(\d )', expand=False)

Then rename the diff_table columns before reset_index based on the integer in your original column header

import re

diff_table.rename(columns=lambda col: re.split(r'(\d )', col)[0]   'diff', inplace=True)

diff_table.reset_index(inplace=True)
  • Related