I have a large Pandas dataframe that looks as follows (85k rows):
df1 = pd.DataFrame({"ID": [1, 1,
2,
3, 3, 3],
"BEGDT": [pd.to_datetime("1986-01-01"), pd.to_datetime("1989-01-01"),
pd.to_datetime("1988-01-01"),
pd.to_datetime("1983-01-01"), pd.to_datetime("1986-01-01"), pd.to_datetime("1987-01-01")],
"ENDDT": [pd.to_datetime("1988-12-31"), pd.to_datetime("1989-12-31"),
pd.to_datetime("1990-12-31"),
pd.to_datetime("1985-12-31"), pd.to_datetime("1986-12-31"), pd.to_datetime("1990-12-31")],
"Inception": [pd.to_datetime("1984-12-04"), pd.to_datetime("1984-12-04"),
pd.to_datetime("1987-06-07"),
pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08")],
"NAME": ["Juan", "Jerome",
"Pedro",
"Javier", "Pastor", "Daniel"]})
My goal is the following: For the first observation of each ID for which the BEGDT > Inception
, copy the row and change the BEGDT
to Inception
and the ENDDT
to BEGDT - 1 day
of the initially copied row.
Accordingly, the final output should look as follows:
df2 = pd.DataFrame({"ID": [1, 1, 1,
2, 2,
3, 3, 3, 3],
"BEGDT": [pd.to_datetime("1984-12-04"), pd.to_datetime("1986-01-01"), pd.to_datetime("1989-01-01"),
pd.to_datetime("1987-06-07"), pd.to_datetime("1988-01-01"),
pd.to_datetime("1982-05-08"), pd.to_datetime("1983-01-01"), pd.to_datetime("1986-01-01"), pd.to_datetime("1987-01-01")],
"ENDDT": [pd.to_datetime("1985-12-31"), pd.to_datetime("1988-12-31"), pd.to_datetime("1989-12-31"),
pd.to_datetime("1987-12-31"), pd.to_datetime("1990-12-31"),
pd.to_datetime("1982-12-31"), pd.to_datetime("1985-12-31"), pd.to_datetime("1986-12-31"), pd.to_datetime("1990-12-31")],
"Inception": [pd.to_datetime("1984-12-04"), pd.to_datetime("1984-12-04"), pd.to_datetime("1984-12-04"),
pd.to_datetime("1987-06-07"), pd.to_datetime("1987-06-07"),
pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08")],
"NAME": ["Juan", "Juan", "Jerome",
"Pedro", "Pedro",
"Javier", "Javier", "Pastor", "Daniel"]})
I assume that first, I have to group the data with df1.groupby("ID").first()
, next do the calculations and finally, insert these rows into df1
. However, I am not sure if this is the best way to do it.
Any help would be appreciated.
CodePudding user response:
Editing the values can be done on a copy of the dataframe (we'll call it tmp
) to expedite things, rather than within the groupby on each individual group. We can then filter by BEGDT > Inception
, groupby.first
, like you said, get the index values, fetch those rows from our copy and combine the two:
tmp = df1.copy()
tmp['ENDDT'] = tmp.BEGDT - pd.Timedelta('1 day')
tmp['BEGDT'] = tmp.Inception
(pd.concat([
df1,
tmp.loc[
df1[df1.BEGDT > df1.Inception]
.reset_index()
.groupby('ID')
.first()['index']]])
.reset_index(drop=True)
.sort_values(by='ID')
)