I need to add some values to a dataframe based on the ID and DATE_TWO columns. In the case when DATE_TWO >= DATE_ONE then fill in any subsequent DATE_TWO values for that ID with the first DATE_TWO value. Here is the original dataframe:
ID | EVENT | DATE_ONE | DATE_TWO |
---|---|---|---|
1 | 13 | 3/1/2021 | |
1 | 20 | 3/5/2021 | 3/5/2021 |
1 | 32 | 3/6/2021 | |
1 | 43 | 3/7/2021 | |
2 | 1 | 3/3/2021 | |
2 | 2 | 4/5/2021 | |
3 | 1 | 3/1/2021 | |
3 | 12 | 3/7/2021 | 3/7/2021 |
3 | 13 | 3/9/2021 | |
3 | 15 | 3/14/2021 |
Here is what the table after transformation:
ID | EVENT | DATE_ONE | DATE_TWO |
---|---|---|---|
1 | 13 | 3/1/2021 | |
1 | 20 | 3/5/2021 | 3/5/2021 |
1 | 32 | 3/6/2021 | 3/5/2021 |
1 | 43 | 3/7/2021 | 3/5/2021 |
2 | 1 | 3/3/2021 | |
2 | 2 | 4/5/2021 | |
3 | 1 | 3/1/2021 | |
3 | 12 | 3/7/2021 | 3/7/2021 |
3 | 13 | 3/9/2021 | 3/7/2021 |
3 | 15 | 3/14/2021 | 3/7/2021 |
This could be done with a for loop, but I know in python - particularly with dataframes - for loops can be slow. Is there some other more python and computationally speedy way to accomplish what I am seeking?
data = {'ID': [1,1,1,1,2,2,3,3,3,3],
'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
'3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
'DATE_TWO': ['','3/5/2021','','','','','','3/7/2021','','']}
CodePudding user response:
I slightly changed your data so we can see how it works.
Data
import pandas as pd
import numpy as np
data = {'ID': [1,1,1,1,2,2,3,3,3,3],
'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
'3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
'DATE_TWO': ['','3/5/2021','','','','','3/7/2021','','3/7/2021','']}
df = pd.DataFrame(data)
df["DATE_ONE"] = pd.to_datetime(df["DATE_ONE"])
df["DATE_TWO"] = pd.to_datetime(df["DATE_TWO"])
# We better sort DATE_ONE
df = df.sort_values(["ID", "DATE_ONE"]).reset_index(drop=True)
FILL with condition
df["COND"] = np.where(df["DATE_ONE"].le(df["DATE_TWO"]).eq(True),
1,
np.where(df["DATE_TWO"].notnull() &
df["DATE_ONE"].gt(df["DATE_TWO"]),
0,
np.nan))
grp = df.groupby("ID")
df["COND"] = grp["COND"].fillna(method='ffill').fillna(0)
df["FILL"] = grp["DATE_TWO"].fillna(method='ffill')
df["DATE_TWO"] = np.where(df["COND"].eq(1), df["FILL"], df["DATE_TWO"])
df = df.drop(columns=["COND", "FILL"])
ID EVENT DATE_ONE DATE_TWO
0 1 12 2021-03-01 NaT
1 1 20 2021-03-05 2021-03-05
2 1 32 2021-03-06 2021-03-05
3 1 43 2021-03-07 2021-03-05
4 2 1 2021-03-03 NaT
5 2 2 2021-04-05 NaT
6 3 1 2021-03-01 2021-03-07
7 3 12 2021-03-07 2021-03-07
8 3 13 2021-03-09 2021-03-07
9 3 15 2021-03-14 NaT