Modify value in column based on conditions from another columns pandas-CodePudding

I have this example df:

start_date   end_date id
22-07-2022 28-07-2022  A
22-07-2022 28-07-2022  B
22-07-2022 17-08-2022  A

I need to obtain this result:

start_date   end_date id
22-07-2022 28-07-2022  A
22-07-2022 28-07-2022  B
29-07-2022 17-08-2022  A

To put it in words, I need to copy the earliest end_date 1 day from one row with the same id into the another row's start_date, so the dates don't overlap. There will always be 2 rows max per id and the start_date will be the same for both of them.

Right now I'm trying to use a for loop adding conditions but I guess there's a more Pythonic way to do this using pandas functions. Any hint?

Thanks

CodePudding user response：

The following code defines the same data you provide programmatically and does the type of transformation you seek.

import pandas as pd
from datetime import timedelta

data = {
    "start_date": ["22-07-2022", "22-07-2022", "22-07-2022"],
    "end_date": ["28-07-2022", "28-07-2022", "17-08-2022"],
    "id": ["A", "B", "A"],
}

df = (
    pd.DataFrame(data)
    .assign(start_date=lambda x: pd.to_datetime(x.start_date, format="%d-%m-%Y"))
    .assign(end_date=lambda x: pd.to_datetime(x.end_date, format="%d-%m-%Y"))
    .sort_values("end_date")
)


df["new_start_date"] = (
    df.groupby("id").end_date.shift()   timedelta(days=1)
).combine_first(df.start_date)

In the last line, we group by the id variable, and shift each sorted end_date of each id forward, leaving a NaN like in the first row of the group. Then add an extra day to that column to advance the end date to the next day. Finally, we coalesce/combine_first the start_date with this new columns fill the gaps.