Home > Software engineering >  Fill in dataframe values based on group criteria without for loop?
Fill in dataframe values based on group criteria without for loop?

Time:09-25

I need to add some values to a dataframe based on the ID and DATE_TWO columns. In the case when DATE_TWO >= DATE_ONE then fill in any subsequent DATE_TWO values for that ID with the first DATE_TWO value. Here is the original dataframe:

ID EVENT DATE_ONE DATE_TWO
1 13 3/1/2021
1 20 3/5/2021 3/5/2021
1 32 3/6/2021
1 43 3/7/2021
2 1 3/3/2021
2 2 4/5/2021
3 1 3/1/2021
3 12 3/7/2021 3/7/2021
3 13 3/9/2021
3 15 3/14/2021

Here is what the table after transformation:

ID EVENT DATE_ONE DATE_TWO
1 13 3/1/2021
1 20 3/5/2021 3/5/2021
1 32 3/6/2021 3/5/2021
1 43 3/7/2021 3/5/2021
2 1 3/3/2021
2 2 4/5/2021
3 1 3/1/2021
3 12 3/7/2021 3/7/2021
3 13 3/9/2021 3/7/2021
3 15 3/14/2021 3/7/2021

This could be done with a for loop, but I know in python - particularly with dataframes - for loops can be slow. Is there some other more python and computationally speedy way to accomplish what I am seeking?

data = {'ID': [1,1,1,1,2,2,3,3,3,3],
    'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
    'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
                 '3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
    'DATE_TWO': ['','3/5/2021','','','','','','3/7/2021','','']}

CodePudding user response:

I slightly changed your data so we can see how it works.

Data

import pandas as pd
import numpy as np


data = {'ID': [1,1,1,1,2,2,3,3,3,3],
    'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
    'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
                 '3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
    'DATE_TWO': ['','3/5/2021','','','','','3/7/2021','','3/7/2021','']}

df = pd.DataFrame(data)

df["DATE_ONE"] = pd.to_datetime(df["DATE_ONE"])
df["DATE_TWO"] = pd.to_datetime(df["DATE_TWO"])

# We better sort DATE_ONE
df = df.sort_values(["ID", "DATE_ONE"]).reset_index(drop=True)

FILL with condition

df["COND"] = np.where(df["DATE_ONE"].le(df["DATE_TWO"]).eq(True),
                      1, 
                     np.where(df["DATE_TWO"].notnull() & 
                              df["DATE_ONE"].gt(df["DATE_TWO"]),
                              0,
                              np.nan))

grp = df.groupby("ID")

df["COND"] = grp["COND"].fillna(method='ffill').fillna(0)

df["FILL"] = grp["DATE_TWO"].fillna(method='ffill')

df["DATE_TWO"] = np.where(df["COND"].eq(1), df["FILL"], df["DATE_TWO"])

df = df.drop(columns=["COND", "FILL"])
   ID  EVENT   DATE_ONE   DATE_TWO
0   1     12 2021-03-01        NaT
1   1     20 2021-03-05 2021-03-05
2   1     32 2021-03-06 2021-03-05
3   1     43 2021-03-07 2021-03-05
4   2      1 2021-03-03        NaT
5   2      2 2021-04-05        NaT
6   3      1 2021-03-01 2021-03-07
7   3     12 2021-03-07 2021-03-07
8   3     13 2021-03-09 2021-03-07
9   3     15 2021-03-14        NaT
  • Related