I have a dataframe where two of the columns Start
and End
are lists of dates. What I would like to do is create two separate dataframes where, for the first dataframe, the first value in the Start
column matches the value in the End
column, while for the second dataframe the second value in the Start
column also matches the value in the End
column.
Basically, if there are two values in the Start
column, then as long as the date in the End
column is after the first date and before the second date (as given in row BBB in the examples below), then I want to put these values into two separate dataframes. Additionally, even if there's no date in the End
column (as in row EEE in the examples below) then I still want to split it. Finally, if either or both the Start
and End
columns are empty, then they are kept in both dataframes.
As an example, for the dataframe below:
Name Start End
AAA 2017-09-13
BBB 2021-11-20, 2022-06-04 2022-04-07
CCC 2022-09-29
DDD
EEE 2021-04-28, 2022-06-14
I am trying to get the first dataframe to look like this:
Name Start End
AAA 2017-09-13
BBB 2021-11-20 2022-04-07
CCC 2022-09-29
DDD
EEE 2021-04-28
and the second dataframe to look like this:
Name Start End
AAA 2016-09-13
BBB 2022-06-04 2022-04-07
CCC 2022-09-29
DDD
EEE 2022-06-14
If the dates in the Start
and End
columns weren't in lists, it would be slightly easier, but as of now I'm finding it difficult to think of a computationally fast way of doing this, so any help would be greatly appreciated, thanks!
CodePudding user response:
You can use:
tmp_df = df.assign(Start=df['Start'].str.split(',')).explode('Start')
df1 = tmp_df.groupby(level=0).first()
df2 = tmp_df.groupby(level=0).last()
NB. if you already have lists, you can skip the .assign(Start=df['Start'].str.split(','))
.
output:
# df1
Name Start End
0 AAA 2017-09-13 None
1 BBB 2021-11-20 2022-04-07
2 CCC None 2022-09-29
3 DDD None None
4 EEE 2021-04-28 None
# df2
Name Start End
0 AAA 2017-09-13 None
1 BBB 2022-06-04 2022-04-07
2 CCC None 2022-09-29
3 DDD None None
4 EEE 2022-06-14 None
CodePudding user response:
What i would do is to create two different columns based on the dates you have and right after that you can create the dfs you need. To do so, I would define two different functions to be applied on a vectorized with map for each case like this:
def first(date) -> str:
return str(date).split(", ")[0]
def second(date:str) -> str:
return str(date).split(", ")[1]
df["first_date"] = df["start"].apply(first)
df["second_date"] = df["start"].apply(second)
Take into account that you should manage now the empty string but as a first approach to your problem it should give you some light.