I have a dataframe e.g.
df=[[5,'2022-06',1,9],[0,'2022-06',2,3],[1,'2022-07',1,8],[45,'2022-06',2,8],[45,'2022-07',1,3],[0,'2022-07',1,16]]
df=pd.DataFrame(df1, columns = ['ID', 'date', 'Stage', 'c'])
Most people have two entries, one in June and one in July. Some have only one entry. I want to track how the variable c has changed from June to July, but only for people who changed from Stage 1 in June to Stage 2 in July.
I started by making tables of the people in Stage 1 in June and Stage 2 in July, then wanted to only take entries with overlap to get rid of people who stay in the same stage or are only in June or only in July:
juneStage1=df(df.Stage==1)&(df.date=='2022-06')]
julyStage2=df[(data.Stage==2)&(df.date=='2022-07')]
I thought merge should work. But then I lose a month:
shift=juneStage1.merge(julyStage2,on=["ID"])
The table only shows entries for June, and I lose the variable c. Also tried:
new_df_june = juneStage1.loc[juneStage1.ID.isin(julyStage2.ID)]
new_df_july= julyStage2.loc[julyStage2.ID.isin(juneStage1.ID)]
shift=(new_df_june.describe()-new_df_july.describe()).div(new_df_june.describe())
Then (I think!) that shows how much the mean of variable c has changed in % terms. I'm not sure this is doing what I expect though.
What I really want is for the merge to give me a table with ID, Stage, c in june, c in July so I can plot a histogram of c in June and of c in July and show that it has shifted. Is there an easy way to do this?
Thanks for any help! :)
CodePudding user response:
Given your data has two months only, you can pivot your table, then query from it:
pivot = df.pivot('ID','date')
shifted = pivot['Stage'].eq((2,1)).all(1)
out = pivot['c'][shifted].copy()
# track changes in the last month only
out['diff'] = out.diff(axis=1).iloc[:,-1]
out
Output:
date 2022-06 2022-07 diff
ID
0 3.0 16.0 13.0
45 8.0 3.0 -5.0
CodePudding user response:
Another possible solution:
pivot = df.pivot_table('c', 'ID', 'date')
diffed = pivot.diff(axis=1).iloc[:,-1].rename('diff')
out = pivot.merge(diffed.loc[diffed.notna()], on='ID')
Output:
2022-06 2022-07 diff
ID
0 3.0 16.0 13.0
45 8.0 3.0 -5.0