How to replace column value with another rows column value based on condition using pandas?-CodePudding

Is there a better way of doing the below manipulation?

df = pd.DataFrame(data={
    "id": ['a', 'a', 'a', 'b', 'b', 'b', 'b'],
    "date": ["2020-01-01", "2020-01-11", "2020-01-10", "2020-01-06", "2020-02-01", "2020-02-22", "2020-02-20"],
    "type": ["start", "start", "closing", "start", "start", "closing", "closing"],
})

for id in df['id'].unique():
    closing_date = df.loc[(df['id'] == id) & (df['type'] == 'closing'), 'date'].min()
    df.loc[(df['id'] == id) & (df['date'] > closing_date), 'date'] = closing_date

CodePudding user response：

I think you might be able to find a nice middle ground between the fully procedural (but fairly readable) approach you started with, and the more fully pandas (but less readable, imho) approaches in the other answers, by popping out your logic as a function operating on a sub-dataframe with just the "id" in question, and then simply applying:

def make_start_precede_close(subdf):
    closing_date = subdf.loc[subdf['type'] == 'closing', 'date'].min()
    subdf.loc[subdf['date'] > closing_date, 'date'] = closing_date
    return subdf

df.groupby("id").apply(make_start_precede_close)

Ultimately, you'll have to make a decision about tradeoffs between testability, reusability, readability and performance. Pure-pandas is likely to win at performance (though not necessarily), but popping out an explicit function wins in the other categories, from my experience.

CodePudding user response：

Avoid doing things row by row (or id by id). First I groupby the variable id and find the minimum date. Then I merge this data with the original dataframe to create a closing_date column, which has a value for every row. Now we only need to take the minimum from the original column date and closing_date. Finally, we remove the closing_date column.

closing_date = df[df['type'] == 'closing'].groupby('id').date.min()
df['closing_date'] = df.merge(closing_date, on='id').date_y

df['date'] = df[['date', 'closing_date']].min(axis=1)
df = df.drop(columns='closing_date')

Output:

  id        date     type
0  a  2020-01-01    start
1  a  2020-01-10    start
2  a  2020-01-10  closing
3  b  2020-01-06    start
4  b  2020-02-01    start
5  b  2020-02-20    start
6  b  2020-02-20  closing

CodePudding user response：

You can use groupby.transform to get the closing date per group, then replace the values that are greater with boolean indexing:

closing = (df['date'].where(df['type'].eq('closing'))
           .groupby(df['id']).transform('last')
          )

df.loc[df['date'].gt(closing), 'date'] = closing

output:

  id        date     type
0  a  2020-01-01    start
1  a  2020-01-10    start
2  a  2020-01-10  closing
3  b  2020-01-06    start
4  b  2020-02-01    start
5  b  2020-02-20    start
6  b  2020-02-20  closing

If you have several "closing" per id and want to ensure getting the most recent one:

closing = (df['date'].sort_values()
           .where(df['type'].eq('closing'))
           .groupby(df['id']).transform('last')
          )

df.loc[df['date'].gt(closing), 'date'] = closing