Column pair subtraction - Pandas-CodePudding

I would like to calculate for a given column pair the difference for each customer_id in a more dynamic way. Column pairs are always in the same order i.e. customer_id will always be column_1 and column_3 will always be subtracted by column_2 and column_5 - column_4 and so on...

Sample df

customer_id   count_sessions_q4_2021  count_sessions_q1_2022  purchases_q4_2021  purchases_q1_2022 
203           100                     110                     12                 11
484           210                     215                     21                 18
582           409                     400                     35                 32

Expected output_df

customer_id   count_sessions_diff   purchases_diff
203           10                    -1
484           5                     -3 
582          -9                     -3

This is what I have tried so far:

df.replace('null', np.nan, inplace=True)
df_2 = df.set_index('customer_id').apply(pd.to_numeric, errors='coerce')

a = np.arange(len(df_2.columns)) // 2

s = (df_2.columns
           .to_series()
           .str.extract('^(.*?)\d', expand=False)
           .groupby(a)
           .agg('_'.join)
           .add('_diff'))

df_2 = df_2.groupby(a, axis=1).diff(1).dropna(how='all', axis=1)

df_2.columns = s

but I am getting the following error: TypeError: sequence item 0: expected str instance, float found

CodePudding user response：

MultiIndex comes in handy here, as it allows for a relatively easy way to reshape and manipulate the dataframe:

Set the customer id as index, and reshape the remaining columns into a multi index

temp = df.set_index('customer_id')
temp.columns = temp.columns.str.split(r'_(q\d )_', expand = True)

temp
            count_sessions      purchases
                        q4   q1        q4   q1
                      2021 2022      2021 2022
customer_id
203                    100  110        12   11
484                    210  215        21   18
582                    409  400        35   32

Iterate through the first level of the columns in a list comprehension and compute the differences; in this case we can tell that 2022 trails 2021 for each section, so we'll multiply our outcome by -1 to flip it:

keys = temp.columns.get_level_values(0).unique()
outcome = [temp[key].agg(np.subtract.reduce, axis = 1).mul(-1)
           for key in keys]
outcome = pd.concat(outcome, axis = 1, keys = keys)
outcome.add_suffix('_diff').reset_index()

   customer_id  count_sessions_diff  purchases_diff
0          203                   10              -1
1          484                    5              -3
2          582                   -9              -3

CodePudding user response：

Here is an easy way to do it, using pandas subtract. I am first setting the index as customer_id, so it isn't lost. Then, every other column is subtracted by the column before it.

df.set_index("customer_id").iloc[:, 1::2].subtract(np.array(df.iloc[:, 1::2]), axis=1)

As you may notice, the .iloc[] is the same for both. This is because I am setting the index for the first, which "drops" the first column (but this isn't applied to the second).

Using np.array inside the subtract means that it doesn't try to match column names.

CodePudding user response：

Here is another way:

(pd.wide_to_long(df,i = 'customer_id',j = 'quarter', stubnames = ['count_sessions','purchases'], suffix = '.*')
 .diff(periods = df.shape[0])
 .dropna()
 .rename('{}_diff'.format,axis=1))

Output:

                      count_sessions_diff  purchases_diff
customer_id quarter                                      
203         _q1_2022                 10.0            -1.0
484         _q1_2022                  5.0            -3.0
582         _q1_2022                 -9.0            -3.0