I have this example df:
start_date end_date id
22-07-2022 28-07-2022 A
22-07-2022 28-07-2022 B
22-07-2022 17-08-2022 A
I need to obtain this result:
start_date end_date id
22-07-2022 28-07-2022 A
22-07-2022 28-07-2022 B
29-07-2022 17-08-2022 A
To put it in words, I need to copy the earliest end_date 1 day from one row with the same id into the another row's start_date, so the dates don't overlap. There will always be 2 rows max per id and the start_date will be the same for both of them.
Right now I'm trying to use a for loop adding conditions but I guess there's a more Pythonic way to do this using pandas functions. Any hint?
Thanks
CodePudding user response:
The following code defines the same data you provide programmatically and does the type of transformation you seek.
import pandas as pd
from datetime import timedelta
data = {
"start_date": ["22-07-2022", "22-07-2022", "22-07-2022"],
"end_date": ["28-07-2022", "28-07-2022", "17-08-2022"],
"id": ["A", "B", "A"],
}
df = (
pd.DataFrame(data)
.assign(start_date=lambda x: pd.to_datetime(x.start_date, format="%d-%m-%Y"))
.assign(end_date=lambda x: pd.to_datetime(x.end_date, format="%d-%m-%Y"))
.sort_values("end_date")
)
df["new_start_date"] = (
df.groupby("id").end_date.shift() timedelta(days=1)
).combine_first(df.start_date)
In the last line, we group by the id
variable, and shift each sorted end_date
of each id forward, leaving a NaN
like in the first row of the group. Then add an extra day to that column to advance the end date to the next day. Finally, we coalesce/combine_first
the start_date
with this new columns fill the gaps.