Edited/reposted with correct sample output.
I have a dataframe that looks like the following:
data = {
"ID": [1, 1, 1, 2, 2, 2],
"Year": [2021, 2021, 2023, 2015, 2017, 2018],
"Combined": ['started', 'finished', 'started', 'started', 'finished', 'started'],
"bool": [True, False, False, True, False, False],
"Update": ['started', 'finished', 'started', 'started', 'finished', 'started']
}
df = pd.DataFrame(data)
print(df)
ID Year Combined bool
1 2021 started True
1 2021 finished False
1 2023 started False
2 2015 started True
2 2017 finished False
2 2018 started False
This dataframe is split into groups by ID
.
I would like to make an updated combined
column based on if df['bool'] == True
, but only if df['bool'] == True
AND there is another 'finished' row in the same group with a LATER (not the same) year.
Sample output:
ID Year Combined bool Update
1 2021 started True started
1 2021 finished False finished
1 2023 started False started
2 2015 started True finished
2 2017 finished False finished
2 2018 started False started
We are not updating the first group because there is not a finished
value in a LATER year, and we are updating the second group because there is a finished
value in a later year. Thank you!
CodePudding user response:
One solution I could think of is to use the apply
and the groupby
methods. The values of each group are passed to the update_group
function via the update
function. This allows performing the tests and return an updated "Update"
column when the conditions are met. Then, the returned DataFrame is the one expected.
If I take over your example, omitting the "Update"
column that will be created in the second part:
import pandas as pd
data = {
"ID": [1, 1, 1, 2, 2, 2],
"Year": [2021, 2021, 2023, 2015, 2017, 2018],
"Combined": ["started", 'finished', 'started', 'started', 'finished', 'started'],
"bool": [True, False, False, True, False, False],
}
df = pd.DataFrame(data)
print(df)
I obtain the following input DataFrame:
ID Year Combined bool
0 1 2021 started True
1 1 2021 finished False
2 1 2023 started False
3 2 2015 started True
4 2 2017 finished False
5 2 2018 started False
These are the two functions I use to update the DataFrame:
def update_group(row, group):
"""Update each row of a group"""
if row["bool"] is True:
# Extract the later years entries
group_later = group[group.Year > row.Year]
# If finished in found, then turn the Update column to finished
if any(group_later.Combined == "finished"):
row["Update"] = "finished"
else:
row["Update"] = row["Combined"]
else:
row["Update"] = row["Combined"]
return row
def update(group):
"""Apply the update to each group"""
return group.apply(update_group, group=group, axis=1)
So if you apply these functions to your DataFrame:
df = df.groupby("ID").apply(update)
print(df)
the returned DataFrame is:
ID Year Combined bool Update
0 1 2021 started True started
1 1 2021 finished False finished
2 1 2023 started False started
3 2 2015 started True finished
4 2 2017 finished False finished
5 2 2018 started False started
CodePudding user response:
This uses temporary columns, and avoids the apply path which can be generally slow:
# identify the start rows that have a True value
start_true = df.Combined.eq('started') & df['bool']
# identify rows where Combined is finished
condition = df.Combined.eq('finished')
# create more temporary variables
year_shift = df.Year.where(condition).bfill()
id_shift = df.ID.where(condition).bfill()
condition = df.ID.eq(id_shift) & df.Year.lt(year_shift)
# if it matches, 'finished', else just return what is in the Combined column
update = np.where(condition, 'finished', df.Combined)
df.assign(Update = update)
ID Year Combined bool Update
0 1 2021 started True started
1 1 2021 finished False finished
2 1 2023 started False started
3 2 2015 started True finished
4 2 2017 finished False finished
5 2 2018 started False started
This solution assumes that the data is sorted on ID and Year in ascending order