Splitting Dates in 2 columns in a Dataframe-CodePudding

This question is actually very similar to my previous one (given here: Splitting Dates in a Dataframe into 2 separate Dataframes), but somewhat more complex, so after trying it for more than an hour I though I should ask.

I have a dataframe where two of the columns Start and End are lists of dates. What I would like to do is create a new dataframe where the Start date matches with the closest End date, and for each matching pair to create a new line in the outputted dataframe

Basically, if there are two values in the Start column, then as long as the date in the End column is after the first date and before the second date (as given in row BBB in the examples below), then I want to keep these values the outputted dataframe. Additionally, even if there's no date in the End column (as in row EEE in the examples below) then I still want to split it. And if either or both the Start and End columns are empty, then they are kept in the dataframe.

As an example, for the dataframe below:

Name               Start                               End
AAA               2017-09-13    
BBB         2021-11-20, 2022-06-04                 2022-04-07
CCC                                                2022-09-29
DDD 
EEE        2021-04-28, 2022-06-14
FFF        2021-06-25, 2022-06-19                2022-03-18, 2024-07-22
GGG      2020-10-23,2021-06-10, 2022-03-02       2021-03-06, 2022-01-04, 2024-08-15

and the final dataframe to look like this:

Name        Start                    End
AAA         2016-09-13          
BBB         2022-06-04             2022-04-07
CCC                                2022-09-29
DDD 
EEE         2022-06-14
FFF         2021-06-25             2022-03-18
FFF         2022-06-19             2024-07-22
GGG         2020-10-23             2021-03-06
GGG         2021-06-10             2022-01-04
GGG         2022-03-02             2024-08-15

I tried to modify the code as given in the link above, but I was unable to get the output I wanted (I'm quite new to Python unfortunately...). So any help would be greatly appreciated, thank you!

CodePudding user response：

It is actually very different from your other question, and not so straightforward.

I would use a merge_asof, with a little trick of temporarily replacing the NaN values of End with a dummy date (here I used '1970-01-01'):

# pre-process the dataframe to split the strings into lists
df2 = df.set_index('Name').apply(lambda s: s.str.split(',\s*'))

out = (pd.merge_asof(
              # explode End and fill NaN with 0 to get 1970-01-01
              pd.to_datetime(df2['End'].explode().fillna(0))
                .sort_values().reset_index(),
              pd.to_datetime(df2['Start'].explode().dropna())
                .sort_values().reset_index(),
              by='Name', left_on='End', right_on='Start',
              direction='nearest'
             )
         .sort_values(by='Name')
         [['Name', 'Start', 'End']]
         .assign(End=lambda d: d['End'].mask(d['End'].eq('1970-01-01')))
      )

output:

  Name      Start        End
0  AAA 2017-09-13        NaT
7  BBB 2022-06-04 2022-04-07
1  CCC 2022-09-29        NaT
2  DDD        NaT        NaT
3  EEE 2021-04-28        NaT
6  FFF 2022-06-19 2022-03-18
8  FFF 2022-06-19 2024-07-22
4  GGG 2021-06-10 2021-03-06
5  GGG 2022-03-02 2022-01-04
9  GGG 2022-03-02 2024-08-15