Panel data: take first observation of each group, repeat row and adjust certain values-CodePudding

I have a large Pandas dataframe that looks as follows (85k rows):

df1 = pd.DataFrame({"ID": [1, 1, 
                           2, 
                           3, 3, 3],          
        "BEGDT": [pd.to_datetime("1986-01-01"), pd.to_datetime("1989-01-01"),
                  pd.to_datetime("1988-01-01"),
                  pd.to_datetime("1983-01-01"), pd.to_datetime("1986-01-01"), pd.to_datetime("1987-01-01")],
        "ENDDT": [pd.to_datetime("1988-12-31"), pd.to_datetime("1989-12-31"),
                  pd.to_datetime("1990-12-31"),
                  pd.to_datetime("1985-12-31"), pd.to_datetime("1986-12-31"), pd.to_datetime("1990-12-31")],
        "Inception": [pd.to_datetime("1984-12-04"), pd.to_datetime("1984-12-04"),
                      pd.to_datetime("1987-06-07"),
                      pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08")],   
        "NAME": ["Juan", "Jerome",
                 "Pedro",
                 "Javier", "Pastor", "Daniel"]})

My goal is the following: For the first observation of each ID for which the BEGDT > Inception, copy the row and change the BEGDT to Inception and the ENDDT to BEGDT - 1 day of the initially copied row.

Accordingly, the final output should look as follows:


df2 = pd.DataFrame({"ID": [1, 1, 1,
                           2, 2,
                           3, 3, 3, 3],          
        "BEGDT": [pd.to_datetime("1984-12-04"), pd.to_datetime("1986-01-01"), pd.to_datetime("1989-01-01"),
                  pd.to_datetime("1987-06-07"), pd.to_datetime("1988-01-01"),
                  pd.to_datetime("1982-05-08"), pd.to_datetime("1983-01-01"), pd.to_datetime("1986-01-01"), pd.to_datetime("1987-01-01")],
        "ENDDT": [pd.to_datetime("1985-12-31"), pd.to_datetime("1988-12-31"), pd.to_datetime("1989-12-31"),
                  pd.to_datetime("1987-12-31"), pd.to_datetime("1990-12-31"),
                  pd.to_datetime("1982-12-31"), pd.to_datetime("1985-12-31"), pd.to_datetime("1986-12-31"), pd.to_datetime("1990-12-31")],
        "Inception": [pd.to_datetime("1984-12-04"), pd.to_datetime("1984-12-04"), pd.to_datetime("1984-12-04"),
                      pd.to_datetime("1987-06-07"), pd.to_datetime("1987-06-07"),
                      pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08")],   
        "NAME": ["Juan", "Juan", "Jerome",
                 "Pedro", "Pedro",
                 "Javier", "Javier", "Pastor", "Daniel"]})

I assume that first, I have to group the data with df1.groupby("ID").first(), next do the calculations and finally, insert these rows into df1. However, I am not sure if this is the best way to do it.

Any help would be appreciated.

CodePudding user response：

Editing the values can be done on a copy of the dataframe (we'll call it tmp) to expedite things, rather than within the groupby on each individual group. We can then filter by BEGDT > Inception, groupby.first, like you said, get the index values, fetch those rows from our copy and combine the two:

tmp = df1.copy()

tmp['ENDDT'] = tmp.BEGDT - pd.Timedelta('1 day')
tmp['BEGDT'] = tmp.Inception

(pd.concat([
    df1, 
    tmp.loc[
        df1[df1.BEGDT > df1.Inception]
        .reset_index()
        .groupby('ID')
        .first()['index']]])
 .reset_index(drop=True)
 .sort_values(by='ID')
)