Home > Net >  How to calculate the Successive Month difference with Groupby in pandas
How to calculate the Successive Month difference with Groupby in pandas

Time:09-08

I have the pandas dataframe,

data = pd.DataFrame([['1001','2020-03-06'],
                     ['1001','2020-04-06'],
                     ['1002','2021-04-02'],
                     ['1003','2022-07-08'],
                     ['1001','2020-09-06'],
                     ['1003','2022-04-04'],
                     ['1002','2021-06-05'],
                     ['1007','2020-09-08'],
                     ['1002','2021-12-07'],
                     ['1003','2022-12-06'],
                     ['1007','2020-02-10'],
                     ], 
                    columns=['Type', 'Date'])

I need to Group by each id and then apply the monthly difference in each id to get monthly frequency number

I tried out : Solution :

data['Date'] = pd.to_datetime(data['Date'])
data['diff'] = data.groupby(['Type'])['Date'].apply(lambda x:(x.max() - x)/np.timedelta64(1, 'M'))
data['diff'] = data['diff'].astype(int) 
data = data.sort_values('Type')

I am expecting the Output dataframe :

Dataframe :         ([['1007','2020-09-08', 0],
                     ['1007','2020-02-10', 7],
                     ['1003','2022-12-06', 0],
                     ['1003','2022-07-08', 5],
                     ['1003','2022-04-04', 3],
                     ['1002','2021-12-07', 0],
                     ['1002','2021-06-05', 6],
                     ['1002','2021-04-02', 2],
                     ['1001','2020-09-06', 0],
                     ['1001','2020-04-06', 5],
                     ['1001','2020-03-06', 1],
                     ], 
                    columns=['Type', 'Date', 'MonthlyFreq'])

CodePudding user response:

You can use period objects to calculate the number of monthly periods in between 2 dates:

data['Date'] = pd.to_datetime(data['Date'])
data = data.sort_values(['Type', 'Date'], ascending=False)

data['diff'] = (data['Date']
 .dt.to_period('M') # convert to monthly period
 .groupby(data['Type']).diff().mul(-1)
 .apply(lambda x: 0 if pd.isna(x) else x.n)      
)

output:

    Type       Date  diff
7   1007 2020-09-08     0
10  1007 2020-02-10     7
9   1003 2022-12-06     0
3   1003 2022-07-08     5
5   1003 2022-04-04     3
8   1002 2021-12-07     0
6   1002 2021-06-05     6
2   1002 2021-04-02     2
4   1001 2020-09-06     0
1   1001 2020-04-06     5
0   1001 2020-03-06     1

CodePudding user response:

Why do you do x.max() - x? Wouldn't that the difference to the max, not to the previous record? Try:

data = data.sort_values(['Type','Date'], ascending=False)

# use `round` instead of `astype(int)` if you want
data['diff'] = - data.groupby(['Type'])['Date'].diff().div(np.timedelta64(1, 'M')).fillna(0).astype(int)

Output:

    Type       Date  diff
7   1007 2020-09-08     0
10  1007 2020-02-10     6
9   1003 2022-12-06     0
3   1003 2022-07-08     4
5   1003 2022-04-04     3
8   1002 2021-12-07     0
6   1002 2021-06-05     6
2   1002 2021-04-02     2
4   1001 2020-09-06     0
1   1001 2020-04-06     5
0   1001 2020-03-06     1

Another way is just to do some arithmetic on the year and month:

data['diff'] = (data['Date'].dt.year.mul(12)
                .add(data['Date'].dt.month)
                .groupby(data['Type']).diff()
                .fillna(0).mul(-1).astype(int)
               )

Output:

    Type       Date  diff
7   1007 2020-09-08     0
10  1007 2020-02-10     7
9   1003 2022-12-06     0
3   1003 2022-07-08     5
5   1003 2022-04-04     3
8   1002 2021-12-07     0
6   1002 2021-06-05     6
2   1002 2021-04-02     2
4   1001 2020-09-06     0
1   1001 2020-04-06     5
0   1001 2020-03-06     1
  • Related