I have a dataframe representing all changes that have been made to a record over time. Among other things, this dataframe contains a record id (in this case not unique and not meant to be as it tracks multiple changes to the same record on a different table), startdate and enddate. Enddate is only included if it is know/preset, often it is not. I would like to map the enddate of each change record to the startdate of the next record in the dataframe with the same id.
>>> thing = pd.DataFrame([
... {'id':1,'startdate':date(2021,1,1),'enddate':date(2022,1,1)},
... {'id':1,'startdate':date(2021,3,24),'enddate':None},
... {'id':1,'startdate':date(2021,5,26),'enddate':None},
... {'id':2,'startdate':date(2021,2,2),'enddate':None},
... {'id':2,'startdate':date(2021,11,26),'enddate':None}
... ])
>>> thing
id startdate enddate
0 1 2021-01-01 2022-01-01
1 1 2021-03-24 None
2 1 2021-05-26 None
3 2 2021-02-02 None
4 2 2021-11-26 None
The dataframe is already sorted by the creation timestamp of the record and the id. I tried this:
thing['enddate'] = thing.groupby('id')['startdate'].apply(lambda x: x.shift())
However the above code only maps this to around 10,000 of my 120,000 rows, the majority of which would have an enddate if I were to do this comparison by hand. Can anyone think of a better way to perform this kind of manipulation? For reference, give the dataframe above I'd like to create this one:
>>> thing
id startdate enddate
0 1 2021-01-01 2021-03-24
1 1 2021-03-24 2021-05-26
2 1 2021-05-26 None
3 2 2021-02-02 2021-11-26
4 2 2021-11-26 None
The idea is that once this transformation is done, I'll have a timeframe between which the configurations stored in the other columns (not impportant for this) were in place
CodePudding user response:
here is one way to do it
use transform with the groupby to assign back the values to the rows comprising the group
df['enddate']=df.groupby(['id'])['startdate'].transform(lambda x: x.shift(-1))
df
id startdate enddate
0 1 2021-01-01 2021-03-24
1 1 2021-03-24 2021-05-26
2 1 2021-05-26 NaT
3 2 2021-02-02 2021-11-26
4 2 2021-11-26 NaT