Is there a better way of doing the below manipulation?
df = pd.DataFrame(data={
"id": ['a', 'a', 'a', 'b', 'b', 'b', 'b'],
"date": ["2020-01-01", "2020-01-11", "2020-01-10", "2020-01-06", "2020-02-01", "2020-02-22", "2020-02-20"],
"type": ["start", "start", "closing", "start", "start", "closing", "closing"],
})
for id in df['id'].unique():
closing_date = df.loc[(df['id'] == id) & (df['type'] == 'closing'), 'date'].min()
df.loc[(df['id'] == id) & (df['date'] > closing_date), 'date'] = closing_date
CodePudding user response:
I think you might be able to find a nice middle ground between the fully procedural (but fairly readable) approach you started with, and the more fully pandas (but less readable, imho) approaches in the other answers, by popping out your logic as a function operating on a sub-dataframe with just the "id" in question, and then simply applying:
def make_start_precede_close(subdf):
closing_date = subdf.loc[subdf['type'] == 'closing', 'date'].min()
subdf.loc[subdf['date'] > closing_date, 'date'] = closing_date
return subdf
df.groupby("id").apply(make_start_precede_close)
Ultimately, you'll have to make a decision about tradeoffs between testability, reusability, readability and performance. Pure-pandas is likely to win at performance (though not necessarily), but popping out an explicit function wins in the other categories, from my experience.
CodePudding user response:
Avoid doing things row by row (or id by id). First I groupby the variable id
and find the minimum date. Then I merge this data with the original dataframe to create a closing_date
column, which has a value for every row. Now we only need to take the minimum from the original column date
and closing_date
. Finally, we remove the closing_date
column.
closing_date = df[df['type'] == 'closing'].groupby('id').date.min()
df['closing_date'] = df.merge(closing_date, on='id').date_y
df['date'] = df[['date', 'closing_date']].min(axis=1)
df = df.drop(columns='closing_date')
Output:
id date type
0 a 2020-01-01 start
1 a 2020-01-10 start
2 a 2020-01-10 closing
3 b 2020-01-06 start
4 b 2020-02-01 start
5 b 2020-02-20 start
6 b 2020-02-20 closing
CodePudding user response:
You can use groupby.transform
to get the closing date per group, then replace the values that are greater with boolean indexing:
closing = (df['date'].where(df['type'].eq('closing'))
.groupby(df['id']).transform('last')
)
df.loc[df['date'].gt(closing), 'date'] = closing
output:
id date type
0 a 2020-01-01 start
1 a 2020-01-10 start
2 a 2020-01-10 closing
3 b 2020-01-06 start
4 b 2020-02-01 start
5 b 2020-02-20 start
6 b 2020-02-20 closing
If you have several "closing" per id and want to ensure getting the most recent one:
closing = (df['date'].sort_values()
.where(df['type'].eq('closing'))
.groupby(df['id']).transform('last')
)
df.loc[df['date'].gt(closing), 'date'] = closing