I have a dataframe like below.
Id | d_of_arr | d_of_sty |
---|---|---|
1 | 2021-12-03 | 2021-12-04 |
1 | 2021-12-03 | 2021-12-05 |
1 | 2021-12-03 | 2021-12-06 |
2 | 2021-12-09 | 2021-12-10 |
2 | 2021-12-09 | 2021-12-11 |
I want to add a column which shows the arrival date and all the dates of staying like below,
Id | dates |
---|---|
1 | 2021-12-03 |
1 | 2021-12-04 |
1 | 2021-12-05 |
1 | 2021-12-06 |
2 | 2021-12-09 |
2 | 2021-12-10 |
2 | 2021-12-11 |
How to do this using python/pandas?
CodePudding user response:
If performance or large DataFrame use Index.repeat
by difference by days for duplicate rows, add timedeltas by counter GroupBy.cumcount
and to_timedelta
and last sorting with remove duplicates:
df['d_of_arr'] = pd.to_datetime(df['d_of_arr'])
df['d_of_sty'] = pd.to_datetime(df['d_of_sty'])
df = df.loc[df.index.repeat(df['d_of_sty'].sub(df['d_of_arr']).dt.days.add(1))]
df['dates'] = df['d_of_arr'].add(pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d'))
df1 = df[['Id','dates']].sort_values(['Id','dates']).drop_duplicates(ignore_index=True)
Or if small DataFrame or performance not important use list comprehension with DataFrame.explode
for new rows:
df['dates'] = [pd.date_range(s, e) for s, e in zip(df['d_of_arr'], df['d_of_sty'])]
df1 = (df.explode('dates')[['Id','dates']]
.sort_values(['Id','dates'])
.drop_duplicates(ignore_index=True))
print (df1)
Id dates
0 1 2021-12-03
1 1 2021-12-04
2 1 2021-12-05
3 1 2021-12-06
4 2 2021-12-09
5 2 2021-12-10
6 2 2021-12-11