Home > Blockchain >  Track changes in variables between two months
Track changes in variables between two months

Time:09-04

I have a dataframe e.g.

df=[[5,'2022-06',1,9],[0,'2022-06',2,3],[1,'2022-07',1,8],[45,'2022-06',2,8],[45,'2022-07',1,3],[0,'2022-07',1,16]]
df=pd.DataFrame(df1, columns = ['ID', 'date', 'Stage', 'c'])

Most people have two entries, one in June and one in July. Some have only one entry. I want to track how the variable c has changed from June to July, but only for people who changed from Stage 1 in June to Stage 2 in July.

I started by making tables of the people in Stage 1 in June and Stage 2 in July, then wanted to only take entries with overlap to get rid of people who stay in the same stage or are only in June or only in July:

juneStage1=df(df.Stage==1)&(df.date=='2022-06')]
julyStage2=df[(data.Stage==2)&(df.date=='2022-07')]

I thought merge should work. But then I lose a month:

shift=juneStage1.merge(julyStage2,on=["ID"])

The table only shows entries for June, and I lose the variable c. Also tried:

new_df_june = juneStage1.loc[juneStage1.ID.isin(julyStage2.ID)]
new_df_july= julyStage2.loc[julyStage2.ID.isin(juneStage1.ID)]
shift=(new_df_june.describe()-new_df_july.describe()).div(new_df_june.describe())

Then (I think!) that shows how much the mean of variable c has changed in % terms. I'm not sure this is doing what I expect though.

What I really want is for the merge to give me a table with ID, Stage, c in june, c in July so I can plot a histogram of c in June and of c in July and show that it has shifted. Is there an easy way to do this?

Thanks for any help! :)

CodePudding user response:

Given your data has two months only, you can pivot your table, then query from it:

pivot = df.pivot('ID','date')

shifted = pivot['Stage'].eq((2,1)).all(1)

out = pivot['c'][shifted].copy()

# track changes in the last month only
out['diff'] = out.diff(axis=1).iloc[:,-1]
out

Output:

date  2022-06  2022-07  diff
ID                          
0         3.0     16.0  13.0
45        8.0      3.0  -5.0

CodePudding user response:

Another possible solution:

pivot = df.pivot_table('c', 'ID', 'date')
diffed = pivot.diff(axis=1).iloc[:,-1].rename('diff')
out = pivot.merge(diffed.loc[diffed.notna()], on='ID')

Output:

    2022-06  2022-07  diff
ID                        
0       3.0     16.0  13.0
45      8.0      3.0  -5.0
  • Related