I have a pandas dataframe like this:
user_id | code | earliest_date |
---|---|---|
A | N18.1 | 2016-04-08 |
A | N18.2 | 2017-06-08 |
A | N18.3 | 2018-04-08 |
B | N18.1 | 2016-06-16 |
B | N18.2 | 2016-08-16 |
C | N18.1 | 2021-01-11 |
In the code columns the values are always form a specified list, here [N18.1,N18.2,N18.3] etc.
I want a dataframe that displays the days it takes to progress from code N18.x to the next one in sequence for each user. I am thinking something like this:
user_id | N18.1 -> N18.2 | N18.2 -> N18.3 |
---|---|---|
A | 426 | 304 |
B | 60 | n/a |
C | n/a | n/a |
Your help will be much appreciated.
CodePudding user response:
Use DataFrameGroupBy.diff
for differecies per groups, create helper column for shifted code
and pass to DataFrame.pivot
:
df['earliest_date'] = pd.to_datetime(df['earliest_date'])
df['diff'] = df.groupby('user_id')['earliest_date'].diff().dt.days
df['code1'] = df.groupby('user_id')['code'].shift() '->' df['code']
df = df.pivot('user_id','code1','diff').drop(np.nan, axis=1)
print (df)
code1 N18.1->N18.2 N18.2->N18.3
user_id
A 426.0 304.0
B 61.0 NaN
C NaN NaN