I would like to calculate for a given column pair the difference for each customer_id
in a more dynamic way. Column pairs are always in the same order i.e. customer_id
will always be column_1
and column_3
will always be subtracted by column_2
and column_5 - column_4
and so on...
Sample df
customer_id count_sessions_q4_2021 count_sessions_q1_2022 purchases_q4_2021 purchases_q1_2022
203 100 110 12 11
484 210 215 21 18
582 409 400 35 32
Expected output_df
customer_id count_sessions_diff purchases_diff
203 10 -1
484 5 -3
582 -9 -3
This is what I have tried so far:
df.replace('null', np.nan, inplace=True)
df_2 = df.set_index('customer_id').apply(pd.to_numeric, errors='coerce')
a = np.arange(len(df_2.columns)) // 2
s = (df_2.columns
.to_series()
.str.extract('^(.*?)\d', expand=False)
.groupby(a)
.agg('_'.join)
.add('_diff'))
df_2 = df_2.groupby(a, axis=1).diff(1).dropna(how='all', axis=1)
df_2.columns = s
but I am getting the following error:
TypeError: sequence item 0: expected str instance, float found
CodePudding user response:
MultiIndex comes in handy here, as it allows for a relatively easy way to reshape and manipulate the dataframe:
Set the customer id as index, and reshape the remaining columns into a multi index
temp = df.set_index('customer_id')
temp.columns = temp.columns.str.split(r'_(q\d )_', expand = True)
temp
count_sessions purchases
q4 q1 q4 q1
2021 2022 2021 2022
customer_id
203 100 110 12 11
484 210 215 21 18
582 409 400 35 32
Iterate through the first level of the columns in a list comprehension and compute the differences; in this case we can tell that 2022 trails 2021 for each section, so we'll multiply our outcome by -1 to flip it:
keys = temp.columns.get_level_values(0).unique()
outcome = [temp[key].agg(np.subtract.reduce, axis = 1).mul(-1)
for key in keys]
outcome = pd.concat(outcome, axis = 1, keys = keys)
outcome.add_suffix('_diff').reset_index()
customer_id count_sessions_diff purchases_diff
0 203 10 -1
1 484 5 -3
2 582 -9 -3
CodePudding user response:
Here is an easy way to do it, using pandas subtract. I am first setting the index as customer_id
, so it isn't lost. Then, every other column is subtracted by the column before it.
df.set_index("customer_id").iloc[:, 1::2].subtract(np.array(df.iloc[:, 1::2]), axis=1)
As you may notice, the .iloc[]
is the same for both. This is because I am setting the index for the first, which "drops" the first column (but this isn't applied to the second).
Using np.array
inside the subtract
means that it doesn't try to match column names.
CodePudding user response:
Here is another way:
(pd.wide_to_long(df,i = 'customer_id',j = 'quarter', stubnames = ['count_sessions','purchases'], suffix = '.*')
.diff(periods = df.shape[0])
.dropna()
.rename('{}_diff'.format,axis=1))
Output:
count_sessions_diff purchases_diff
customer_id quarter
203 _q1_2022 10.0 -1.0
484 _q1_2022 5.0 -3.0
582 _q1_2022 -9.0 -3.0