Splitting Dates in a Dataframe into 2 separate Dataframes-CodePudding

I have a dataframe where two of the columns Start and End are lists of dates. What I would like to do is create two separate dataframes where, for the first dataframe, the first value in the Start column matches the value in the End column, while for the second dataframe the second value in the Start column also matches the value in the End column.

Basically, if there are two values in the Start column, then as long as the date in the End column is after the first date and before the second date (as given in row BBB in the examples below), then I want to put these values into two separate dataframes. Additionally, even if there's no date in the End column (as in row EEE in the examples below) then I still want to split it. Finally, if either or both the Start and End columns are empty, then they are kept in both dataframes.

As an example, for the dataframe below:

Name           Start                         End
AAA           2017-09-13            
BBB     2021-11-20, 2022-06-04         2022-04-07
CCC                                    2022-09-29
DDD 
EEE     2021-04-28, 2022-06-14

I am trying to get the first dataframe to look like this:

Name        Start               End
AAA        2017-09-13           
BBB        2021-11-20          2022-04-07
CCC                            2022-09-29
DDD 
EEE        2021-04-28

and the second dataframe to look like this:

Name        Start               End
AAA         2016-09-13          
BBB         2022-06-04         2022-04-07
CCC                            2022-09-29
DDD 
EEE         2022-06-14

If the dates in the Start and End columns weren't in lists, it would be slightly easier, but as of now I'm finding it difficult to think of a computationally fast way of doing this, so any help would be greatly appreciated, thanks!

CodePudding user response：

You can use:

tmp_df = df.assign(Start=df['Start'].str.split(',')).explode('Start')

df1 = tmp_df.groupby(level=0).first()
df2 = tmp_df.groupby(level=0).last()

NB. if you already have lists, you can skip the .assign(Start=df['Start'].str.split(',')).

output:

# df1
  Name       Start         End
0  AAA  2017-09-13        None
1  BBB  2021-11-20  2022-04-07
2  CCC        None  2022-09-29
3  DDD        None        None
4  EEE  2021-04-28        None

# df2
  Name        Start         End
0  AAA   2017-09-13        None
1  BBB   2022-06-04  2022-04-07
2  CCC         None  2022-09-29
3  DDD         None        None
4  EEE   2022-06-14        None

CodePudding user response：

What i would do is to create two different columns based on the dates you have and right after that you can create the dfs you need. To do so, I would define two different functions to be applied on a vectorized with map for each case like this:

def first(date) -> str:
return str(date).split(", ")[0]

def second(date:str) -> str:
    return str(date).split(", ")[1]

df["first_date"] = df["start"].apply(first)
df["second_date"] = df["start"].apply(second)

Take into account that you should manage now the empty string but as a first approach to your problem it should give you some light.