I have the pandas dataframe,
data = pd.DataFrame([['1001','2020-03-06'],
['1001','2020-04-06'],
['1002','2021-04-02'],
['1003','2022-07-08'],
['1001','2020-09-06'],
['1003','2022-04-04'],
['1002','2021-06-05'],
['1007','2020-09-08'],
['1002','2021-12-07'],
['1003','2022-12-06'],
['1007','2020-02-10'],
],
columns=['Type', 'Date'])
I need to Group by each id and then apply the monthly difference in each id to get monthly frequency number
I tried out : Solution :
data['Date'] = pd.to_datetime(data['Date'])
data['diff'] = data.groupby(['Type'])['Date'].apply(lambda x:(x.max() - x)/np.timedelta64(1, 'M'))
data['diff'] = data['diff'].astype(int)
data = data.sort_values('Type')
I am expecting the Output dataframe :
Dataframe : ([['1007','2020-09-08', 0],
['1007','2020-02-10', 7],
['1003','2022-12-06', 0],
['1003','2022-07-08', 5],
['1003','2022-04-04', 3],
['1002','2021-12-07', 0],
['1002','2021-06-05', 6],
['1002','2021-04-02', 2],
['1001','2020-09-06', 0],
['1001','2020-04-06', 5],
['1001','2020-03-06', 1],
],
columns=['Type', 'Date', 'MonthlyFreq'])
CodePudding user response:
You can use period objects to calculate the number of monthly periods in between 2 dates:
data['Date'] = pd.to_datetime(data['Date'])
data = data.sort_values(['Type', 'Date'], ascending=False)
data['diff'] = (data['Date']
.dt.to_period('M') # convert to monthly period
.groupby(data['Type']).diff().mul(-1)
.apply(lambda x: 0 if pd.isna(x) else x.n)
)
output:
Type Date diff
7 1007 2020-09-08 0
10 1007 2020-02-10 7
9 1003 2022-12-06 0
3 1003 2022-07-08 5
5 1003 2022-04-04 3
8 1002 2021-12-07 0
6 1002 2021-06-05 6
2 1002 2021-04-02 2
4 1001 2020-09-06 0
1 1001 2020-04-06 5
0 1001 2020-03-06 1
CodePudding user response:
Why do you do x.max() - x
? Wouldn't that the difference to the max, not to the previous record? Try:
data = data.sort_values(['Type','Date'], ascending=False)
# use `round` instead of `astype(int)` if you want
data['diff'] = - data.groupby(['Type'])['Date'].diff().div(np.timedelta64(1, 'M')).fillna(0).astype(int)
Output:
Type Date diff
7 1007 2020-09-08 0
10 1007 2020-02-10 6
9 1003 2022-12-06 0
3 1003 2022-07-08 4
5 1003 2022-04-04 3
8 1002 2021-12-07 0
6 1002 2021-06-05 6
2 1002 2021-04-02 2
4 1001 2020-09-06 0
1 1001 2020-04-06 5
0 1001 2020-03-06 1
Another way is just to do some arithmetic on the year and month:
data['diff'] = (data['Date'].dt.year.mul(12)
.add(data['Date'].dt.month)
.groupby(data['Type']).diff()
.fillna(0).mul(-1).astype(int)
)
Output:
Type Date diff
7 1007 2020-09-08 0
10 1007 2020-02-10 7
9 1003 2022-12-06 0
3 1003 2022-07-08 5
5 1003 2022-04-04 3
8 1002 2021-12-07 0
6 1002 2021-06-05 6
2 1002 2021-04-02 2
4 1001 2020-09-06 0
1 1001 2020-04-06 5
0 1001 2020-03-06 1