This question is actually very similar to my previous one (given here: Splitting Dates in a Dataframe into 2 separate Dataframes), but somewhat more complex, so after trying it for more than an hour I though I should ask.
I have a dataframe where two of the columns Start
and End
are lists of dates. What I would like to do is create a new dataframe where the Start
date matches with the closest End
date, and for each matching pair to create a new line in the outputted dataframe
Basically, if there are two values in the Start
column, then as long as the date in the End
column is after the first date and before the second date (as given in row BBB in the examples below), then I want to keep these values the outputted dataframe. Additionally, even if there's no date in the End
column (as in row EEE in the examples below) then I still want to split it. And if either or both the Start
and End
columns are empty, then they are kept in the dataframe.
As an example, for the dataframe below:
Name Start End
AAA 2017-09-13
BBB 2021-11-20, 2022-06-04 2022-04-07
CCC 2022-09-29
DDD
EEE 2021-04-28, 2022-06-14
FFF 2021-06-25, 2022-06-19 2022-03-18, 2024-07-22
GGG 2020-10-23,2021-06-10, 2022-03-02 2021-03-06, 2022-01-04, 2024-08-15
and the final dataframe to look like this:
Name Start End
AAA 2016-09-13
BBB 2022-06-04 2022-04-07
CCC 2022-09-29
DDD
EEE 2022-06-14
FFF 2021-06-25 2022-03-18
FFF 2022-06-19 2024-07-22
GGG 2020-10-23 2021-03-06
GGG 2021-06-10 2022-01-04
GGG 2022-03-02 2024-08-15
I tried to modify the code as given in the link above, but I was unable to get the output I wanted (I'm quite new to Python unfortunately...). So any help would be greatly appreciated, thank you!
CodePudding user response:
It is actually very different from your other question, and not so straightforward.
I would use a merge_asof
, with a little trick of temporarily replacing the NaN values of End with a dummy date (here I used '1970-01-01'):
# pre-process the dataframe to split the strings into lists
df2 = df.set_index('Name').apply(lambda s: s.str.split(',\s*'))
out = (pd.merge_asof(
# explode End and fill NaN with 0 to get 1970-01-01
pd.to_datetime(df2['End'].explode().fillna(0))
.sort_values().reset_index(),
pd.to_datetime(df2['Start'].explode().dropna())
.sort_values().reset_index(),
by='Name', left_on='End', right_on='Start',
direction='nearest'
)
.sort_values(by='Name')
[['Name', 'Start', 'End']]
.assign(End=lambda d: d['End'].mask(d['End'].eq('1970-01-01')))
)
output:
Name Start End
0 AAA 2017-09-13 NaT
7 BBB 2022-06-04 2022-04-07
1 CCC 2022-09-29 NaT
2 DDD NaT NaT
3 EEE 2021-04-28 NaT
6 FFF 2022-06-19 2022-03-18
8 FFF 2022-06-19 2024-07-22
4 GGG 2021-06-10 2021-03-06
5 GGG 2022-03-02 2022-01-04
9 GGG 2022-03-02 2024-08-15