Home > Software design >  Groupby first by a conditional value
Groupby first by a conditional value

Time:04-29

I have a pandas dataframe, like this:

ID date status
10 2022-01-01 0
10 2022-01-02 0
10 2022-01-03 1
10 2022-01-04 1
10 2022-01-05 1
23 2022-02-02 0
23 2022-02-03 0
23 2022-02-04 1
23 2022-02-05 1
23 2022-02-06 1

I would like to group per ID and the first date on status is equal 1.

Expected output:

ID date status first_status
10 2022-01-03 1 2022-01-03
23 2022-02-03 1 2022-02-03

afteer this I will merge this new DF with previous DF. Final DF:

ID date status first_status
10 2022-01-01 0 2022-01-03
10 2022-01-02 0 2022-01-03
10 2022-01-03 1 2022-01-03
10 2022-01-04 1 2022-01-03
10 2022-01-05 1 2022-01-03
23 2022-02-02 0 2022-02-04
23 2022-02-03 0 2022-02-04
23 2022-02-04 1 2022-02-04
23 2022-02-05 1 2022-02-04
23 2022-02-06 1 2022-02-04

I tried many ways to do this, but unsuccessful

CodePudding user response:

Get the first date for status=1 for each ID. Then map each ID to the first date:

#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"])

df["first_status"] = df["ID"].map(df[df["status"].eq(1)].groupby("ID")["date"].min())

>>> df
   ID       date  status first_status
0  10 2022-01-01       0   2022-01-03
1  10 2022-01-02       0   2022-01-03
2  10 2022-01-03       1   2022-01-03
3  10 2022-01-04       1   2022-01-03
4  10 2022-01-05       1   2022-01-03
5  23 2022-02-02       0   2022-02-04
6  23 2022-02-03       0   2022-02-04
7  23 2022-02-04       1   2022-02-04
8  23 2022-02-05       1   2022-02-04
9  23 2022-02-06       1   2022-02-04

CodePudding user response:

You can filter the status 1 rows, and get the first (or min depending on the use case) per group, then merge to the orginal dataframe:

df2 = (df[df['status'].eq(1)]
       .groupby('ID', as_index=False)
       ['date'].first() # could also use "min()"
       .rename(columns={'date': 'first_status'})
      )

df.merge(df2, on='ID')

output:

   ID        date  status first_status
0  10  2022-01-01       0   2022-01-03
1  10  2022-01-02       0   2022-01-03
2  10  2022-01-03       1   2022-01-03
3  10  2022-01-04       1   2022-01-03
4  10  2022-01-05       1   2022-01-03
5  23  2022-02-02       0   2022-02-04
6  23  2022-02-03       0   2022-02-04
7  23  2022-02-04       1   2022-02-04
8  23  2022-02-05       1   2022-02-04
9  23  2022-02-06       1   2022-02-04

intermediate df2:

   ID first_status
0  10   2022-01-03
1  23   2022-02-04
  • Related