Home > Software engineering >  Computing yoy changes for multiple clients in same column in dataframe using pandas
Computing yoy changes for multiple clients in same column in dataframe using pandas

Time:05-30

I have a dataframe with four columns:

  • Client ID
  • Date
  • Assets
  • Flows

Not all clients have data for the full set of dates. In such case, rows are missing. Put differently, I don't have the same number of rows for each client.

I would like to compute the following and add in additional columns:

  • Absolute and relative change in Assets over 12m and 3m
  • Sum of Flows over 12m and 3m

When statistics can't be computed (i.e. the first 11m), the column should be filled with nan.

I have tried with group, however can't find a way around the fact that the length of data for each client is different.

Here is an example of my data (first 4 columns) and the wished result (last 4 columns), done in Excel: enter image description here

CodePudding user response:

If you are interested in monthly changes you can add column "months_since" which would equal number of months since a certain date.

df.pivot("months_since", "Client ID", "Assets")

To get a matrix representation of your data. If certain client is a missing an observation he would have a nan value.

Then it is easy to compute sums/delta using df.rolling(12).sum() or df.diff(12)

CodePudding user response:

import pandas as pd
df_data = pd.read_excel('sample2.xlsx')

df_result_change_assets_3m = df_data.pivot_table(values = 'Assets', index = ['Client ID','Date']).unstack(['Client ID']).pct_change(3).unstack(['Date'])['Assets'].reset_index().rename(columns = {0:'Change Assets 3m (%)'})

df_result_change_assets_12m = df_data.pivot_table(values = 'Assets', index = ['Client ID','Date']).unstack(['Client ID']).pct_change(12).unstack(['Date'])['Assets'].reset_index().rename(columns = {0:'Change Assets 12m (%)'})

df_result_change_usd_assets_3m = df_data.pivot_table(values = 'Assets', index = ['Client ID','Date']).unstack(['Client ID'])
df_result_change_usd_assets_3m = df_result_change_usd_assets_3m - df_result_change_usd_assets_3m.shift(3)
df_result_change_usd_assets_3m = df_result_change_usd_assets_3m.unstack(['Date'])['Assets'].reset_index().rename(columns = {0:'Change Assets 3m (USD)'})

df_result_change_usd_assets_12m = df_data.pivot_table(values = 'Assets', index = ['Client ID','Date']).unstack(['Client ID'])
df_result_change_usd_assets_12m = df_result_change_usd_assets_12m - df_result_change_usd_assets_12m.shift(12)
df_result_change_usd_assets_12m = df_result_change_usd_assets_12m.unstack(['Date'])['Assets'].reset_index().rename(columns = {0:'Change Assets 12m (USD)'})


df_data = df_data.merge(df_result_change_assets_3m, how = 'left', on = ['Client ID','Date'])
df_data = df_data.merge(df_result_change_assets_12m, how = 'left', on = ['Client ID','Date'])
df_data = df_data.merge(df_result_change_usd_assets_3m, how = 'left', on = ['Client ID','Date'])
df_data = df_data.merge(df_result_change_usd_assets_12m, how = 'left', on = ['Client ID','Date'])

df_data

CodePudding user response:

I suggest you to split the problem in two: first you calculate the yoy on assets, than you can focus on flow. For the first task you can create two new columns to merge data using a self join. You can create the new columns using pd.tseries.offsets.MonthEnd method as follow:

data_df['date_12MonthsAfter'] = data_df['date']   pd.tseries.offsets.MonthEnd(12)
data_df['date_3MonthsAfter'] = data_df['date']   pd.tseries.offsets.MonthEnd(3)

Than you can merge the data with two self join that calculate the quantity you need. There are a couple of ways to do that, I did it so:

data_merged_12Months = (data_df.merge(data_df[['clientId', 'date_12MonthsAfter', 'assets']], 
                                  left_on = ['clientId', 'date'], 
                                  right_on = ['clientId', 'date_12MonthsAfter'],
                                  how = 'left', 
                                  suffixes=['', '_prevYear'])).drop(['date_12MonthsAfter',
                                                                     'date_3MonthsAfter',
                                                                     'date_12MonthsAfter_prevYear'], axis = 1)
data_merged_3Months = (data_df.merge(data_df[['clientId', 'date_3MonthsAfter', 'assets']], 
                                 left_on = ['clientId', 'date'], 
                                 right_on = ['clientId', 'date_3MonthsAfter'],
                                 how = 'left', 
                                 suffixes=['', '_prevQuarter'])).drop(['assets',
                                                                       'flow',
                                                                       'date_12MonthsAfter',
                                                                       'date_3MonthsAfter',
                                                                       'date_3MonthsAfter_prevQuarter'], axis = 1)
data_merged_assets = data_merged_12Months.merge(data_merged_3Months, on = ['clientId', 'date'])
data_merged_assets['perc yoy'] = (data_merged_assets['assets'] - data_merged_assets['assets_prevYear'])/data_merged_assets['assets_prevYear']
data_merged_assets['perc qoq'] = (data_merged_assets['assets'] - data_merged_assets['assets_prevQuarter'])/data_merged_assets['assets_prevQuarter']

For the flow calculation, of flow you have to replace the blacket cells with zeros, you can do it with method .str.replce('', '0') and than you have to trasform the column data type in int. To calculate the last 12 and 3 months sum I find this solution

data_merged_assets['flow'] = (data_merged_assets.str.replace('', '0')).astype(int)
data_flow_12Months = data_merged_assets.groupby(['clientId']).rolling(on = 'date', window=12, min_periods=12)["flow"].sum().reset_index()
data_flow_3Months = data_merged_assets.groupby(['clientId']).rolling(on = 'date', window=3, min_periods=3)["flow"].sum().reset_index()
data_flow = data_flow_12Months.merge(data_flow_3Months, on = ['clientId', 'date'], suffixes = ['_sumLast12Months', '_sumLast3Months'])

Finally, I merged the two datasets.

data_merged = data_merged_assets.merge(data_flow, on = ['clientId', 'date'])
  • Related