I've got data like this:
user_id order_date
393 15/03/19
393 16/03/19
393 23/03/19
393 24/03/19
393 25/03/19
393 28/03/19
393 29/03/19
393 30/03/19
393 31/03/19
393 05/04/19
1014 08/12/18
1014 09/12/18
1014 18/12/18
1014 20/01/19
1014 22/03/19
1014 23/03/19
1014 30/07/19
What I want to achieve is the following:
user_id order_date month_no
393 15/03/19 1
393 16/03/19 1
393 23/03/19 1
393 24/03/19 1
393 25/03/19 1
393 28/03/19 1
393 29/03/19 1
393 30/03/19 1
393 31/03/19 1
393 05/04/19 2
1014 08/12/18 1
1014 09/12/18 1
1014 18/12/18 1
1014 20/01/19 2
1014 22/03/19 3
1014 23/03/19 3
1014 30/07/19 4
What I was trying to do was:
df["month_no"] = ((df.order_date- df.groupby('user_id')['order_date'].transform('first')) // np.timedelta64(1, 'M')) 1
But it is not working as I intend. I think it's due to the fact that the first order data is in mid-March?
CodePudding user response:
You can use groupby.rank
on the monthly period:
df['month_no'] = (pd.to_datetime(df['order_date'], dayfirst=True)
.dt.to_period('M')
.groupby(df['user_id'])
.rank('dense')
.convert_dtypes()
)
Output:
user_id order_date month_no
0 393 15/03/19 1
1 393 16/03/19 1
2 393 23/03/19 1
3 393 24/03/19 1
4 393 25/03/19 1
5 393 28/03/19 1
6 393 29/03/19 1
7 393 30/03/19 1
8 393 31/03/19 1
9 393 05/04/19 2
10 1014 08/12/18 1
11 1014 09/12/18 1
12 1014 18/12/18 1
13 1014 20/01/19 2
14 1014 22/03/19 3
15 1014 23/03/19 3
16 1014 30/07/19 4