Update column based on grouped date values-CodePudding

Edited/reposted with correct sample output.

I have a dataframe that looks like the following:


data = {
    "ID": [1, 1, 1, 2, 2, 2],
  "Year": [2021, 2021, 2023, 2015, 2017, 2018],
  "Combined": ['started', 'finished', 'started', 'started', 'finished', 'started'],
  "bool": [True, False, False, True, False, False],
"Update": ['started', 'finished', 'started', 'started', 'finished', 'started']

}

df = pd.DataFrame(data)

print(df) 

ID  Year  Combined   bool  
1  2021   started   True   
1  2021  finished  False  
1  2023   started  False   
2  2015   started   True   
2  2017  finished  False  
2  2018   started  False

This dataframe is split into groups by ID.

I would like to make an updated combined column based on if df['bool'] == True, but only if df['bool'] == True AND there is another 'finished' row in the same group with a LATER (not the same) year.

Sample output:

ID  Year  Combined   bool    Update
1  2021   started   True   started
1  2021  finished  False  finished
1  2023   started  False   started
2  2015   started   True   finished
2  2017  finished  False  finished
2  2018   started  False   started

We are not updating the first group because there is not a finished value in a LATER year, and we are updating the second group because there is a finished value in a later year. Thank you!

CodePudding user response：

One solution I could think of is to use the apply and the groupby methods. The values of each group are passed to the update_group function via the update function. This allows performing the tests and return an updated "Update" column when the conditions are met. Then, the returned DataFrame is the one expected.

If I take over your example, omitting the "Update" column that will be created in the second part:

import pandas as pd

data = {
    "ID": [1, 1, 1, 2, 2, 2],
    "Year": [2021, 2021, 2023, 2015, 2017, 2018],
    "Combined": ["started", 'finished', 'started', 'started', 'finished', 'started'],
    "bool": [True, False, False, True, False, False],
}

df = pd.DataFrame(data)
print(df)

I obtain the following input DataFrame:

   ID  Year  Combined   bool
0   1  2021   started   True
1   1  2021  finished  False
2   1  2023   started  False
3   2  2015   started   True
4   2  2017  finished  False
5   2  2018   started  False

These are the two functions I use to update the DataFrame:

def update_group(row, group):
    """Update each row of a group"""
    if row["bool"] is True:
        # Extract the later years entries
        group_later = group[group.Year > row.Year]

        # If finished in found, then turn the Update column to finished
        if any(group_later.Combined == "finished"):
            row["Update"] = "finished"
        else:

            row["Update"] = row["Combined"]
    else:
        row["Update"] = row["Combined"]

    return row

def update(group):
    """Apply the update to each group"""
    return group.apply(update_group, group=group, axis=1)

So if you apply these functions to your DataFrame:

df = df.groupby("ID").apply(update)
print(df)

the returned DataFrame is:

   ID  Year  Combined   bool    Update
0   1  2021   started   True   started
1   1  2021  finished  False  finished
2   1  2023   started  False   started
3   2  2015   started   True  finished
4   2  2017  finished  False  finished
5   2  2018   started  False   started

CodePudding user response：

This uses temporary columns, and avoids the apply path which can be generally slow:

# identify the start rows that have a True value
start_true = df.Combined.eq('started') & df['bool']

# identify rows where Combined is finished
condition = df.Combined.eq('finished')

# create more temporary variables
year_shift = df.Year.where(condition).bfill()
id_shift = df.ID.where(condition).bfill()
condition = df.ID.eq(id_shift) & df.Year.lt(year_shift)

# if it matches, 'finished', else just return what is in the Combined column
update = np.where(condition, 'finished', df.Combined)
df.assign(Update = update)

   ID  Year  Combined   bool    Update
0   1  2021   started   True   started
1   1  2021  finished  False  finished
2   1  2023   started  False   started
3   2  2015   started   True  finished
4   2  2017  finished  False  finished
5   2  2018   started  False   started

This solution assumes that the data is sorted on ID and Year in ascending order