I have a pandas dataframe, like this:
ID | date | status |
---|---|---|
10 | 2022-01-01 | 0 |
10 | 2022-01-02 | 0 |
10 | 2022-01-03 | 1 |
10 | 2022-01-04 | 1 |
10 | 2022-01-05 | 1 |
23 | 2022-02-02 | 0 |
23 | 2022-02-03 | 0 |
23 | 2022-02-04 | 1 |
23 | 2022-02-05 | 1 |
23 | 2022-02-06 | 1 |
I would like to group per ID and the first date on status is equal 1.
Expected output:
ID | date | status | first_status |
---|---|---|---|
10 | 2022-01-03 | 1 | 2022-01-03 |
23 | 2022-02-03 | 1 | 2022-02-03 |
afteer this I will merge this new DF with previous DF. Final DF:
ID | date | status | first_status |
---|---|---|---|
10 | 2022-01-01 | 0 | 2022-01-03 |
10 | 2022-01-02 | 0 | 2022-01-03 |
10 | 2022-01-03 | 1 | 2022-01-03 |
10 | 2022-01-04 | 1 | 2022-01-03 |
10 | 2022-01-05 | 1 | 2022-01-03 |
23 | 2022-02-02 | 0 | 2022-02-04 |
23 | 2022-02-03 | 0 | 2022-02-04 |
23 | 2022-02-04 | 1 | 2022-02-04 |
23 | 2022-02-05 | 1 | 2022-02-04 |
23 | 2022-02-06 | 1 | 2022-02-04 |
I tried many ways to do this, but unsuccessful
CodePudding user response:
Get the first date for status=1 for each ID. Then map each ID to the first date:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"])
df["first_status"] = df["ID"].map(df[df["status"].eq(1)].groupby("ID")["date"].min())
>>> df
ID date status first_status
0 10 2022-01-01 0 2022-01-03
1 10 2022-01-02 0 2022-01-03
2 10 2022-01-03 1 2022-01-03
3 10 2022-01-04 1 2022-01-03
4 10 2022-01-05 1 2022-01-03
5 23 2022-02-02 0 2022-02-04
6 23 2022-02-03 0 2022-02-04
7 23 2022-02-04 1 2022-02-04
8 23 2022-02-05 1 2022-02-04
9 23 2022-02-06 1 2022-02-04
CodePudding user response:
You can filter the status 1 rows, and get the first
(or min
depending on the use case) per group, then merge
to the orginal dataframe:
df2 = (df[df['status'].eq(1)]
.groupby('ID', as_index=False)
['date'].first() # could also use "min()"
.rename(columns={'date': 'first_status'})
)
df.merge(df2, on='ID')
output:
ID date status first_status
0 10 2022-01-01 0 2022-01-03
1 10 2022-01-02 0 2022-01-03
2 10 2022-01-03 1 2022-01-03
3 10 2022-01-04 1 2022-01-03
4 10 2022-01-05 1 2022-01-03
5 23 2022-02-02 0 2022-02-04
6 23 2022-02-03 0 2022-02-04
7 23 2022-02-04 1 2022-02-04
8 23 2022-02-05 1 2022-02-04
9 23 2022-02-06 1 2022-02-04
intermediate df2
:
ID first_status
0 10 2022-01-03
1 23 2022-02-04