Home > OS >  Splitting DataFrame rows into multiple based on a condition
Splitting DataFrame rows into multiple based on a condition

Time:09-29

I have a Dataframe df1 that has a bunch of columns like so:

val_1 val_2 start end val_3 val_4
0 10 70 1/1/2020 3/4/2020 10 20
1 20 80 1/1/2020 3/6/2021 30 40
2 30 90 1/1/2020 6/4/2021 50 60
3 40 100 12/5/2020 7/4/2021 70 80
4 89 300 4/5/2020 6/8/2022 40 10

I need to iterate over the rows, and split the cross-year periods into continuous same year ones. The remaining values in the row need to stay the same and maintain their data types like so:

val_1 val_2 start end val_3 val_4
0 10 70 1/1/2020 3/4/2020 10 20
1 20 80 1/1/2020 12/31/2020 30 40
2 20 80 1/1/2021 3/6/2021 30 40
3 30 90 1/1/2020 12/31/2020 50 60
4 30 90 1/1/2021 6/4/2021 50 60
5 40 100 7/5/2021 11/17/2021 70 80
6 89 300 4/5/2020 12/31/2020 40 10
7 89 300 1/1/2021 12/31/2021 40 10
8 89 300 1/1/2021 6/8/2022 40 10

Is there a fast and efficient way to do this? I tried iterating over the rows and doing it but I'm having trouble with the indices and appending rows after an index. Also, people have said that's probably not the best idea to edit things that I'm iterating over so I was wondering if there is a better way to do it. Any suggestions will be appreciated. Thank you!

EDIT

If the row spans more than a year, that should break into 3 or more rows, accordingly. I've edited the tables to accurately reflect this. Thank you!

CodePudding user response:

Find the year end after date_range, then explode

df['end'] = [[y] pd.date_range(x,y)[pd.date_range(x,y).is_year_end].strftime('%m/%d/%y').tolist() for x , y in zip(df['start'],df['end'])]
df = df.explode('end')
df
Out[29]: 
   val_1  val_2      start       end  val_3  val_4
0     10     70   1/1/2020  3/4/2020     10     20
1     20     80   1/1/2020  3/6/2021     30     40
1     20     80   1/1/2020  12/31/20     30     40
2     30     90   1/1/2020  6/4/2021     50     60
2     30     90   1/1/2020  12/31/20     50     60
3     40    100  12/5/2020  7/4/2021     70     80
3     40    100  12/5/2020  12/31/20     70     80

Update

df.end=pd.to_datetime(df.end)
df.start=pd.to_datetime(df.start)
df['Newstart'] = [list(set([x] pd.date_range(x,y)[pd.date_range(x,y).is_year_start].tolist()))
                  for x , y in zip(df['start'],df['end'])]
df['Newend'] = [[y] pd.date_range(x,y)[pd.date_range(x,y).is_year_end].tolist()
                  for x , y in zip(df['start'],df['end'])]

out = df.explode(['Newend','Newstart'])
val_1  val_2      start        end  val_3  val_4   Newstart     Newend
0     10     70 2020-01-01 2020-03-04     10     20 2020-01-01 2020-03-04
1     20     80 2020-01-01 2021-03-06     30     40 2021-01-01 2021-03-06
1     20     80 2020-01-01 2021-03-06     30     40 2020-01-01 2020-12-31
2     30     90 2020-01-01 2021-06-04     50     60 2021-01-01 2021-06-04
2     30     90 2020-01-01 2021-06-04     50     60 2020-01-01 2020-12-31
3     40    100 2020-12-05 2021-07-04     70     80 2021-01-01 2021-07-04
3     40    100 2020-12-05 2021-07-04     70     80 2020-12-05 2020-12-31
4     89    300 2020-04-05 2022-06-08     40     10 2020-04-05 2022-06-08
4     89    300 2020-04-05 2022-06-08     40     10 2022-01-01 2020-12-31
4     89    300 2020-04-05 2022-06-08     40     10 2021-01-01 2021-12-31

CodePudding user response:

Here's a different approach (note that I've already converted start and end to datetimes):

mask = df.start.dt.year != df.end.dt.year
prev_year = df[mask].copy()
post_year = df[mask].copy()

prev_year.end = prev_year.apply(lambda row: pd.Timestamp(row.start.year, 12, 31), axis=1)
post_year.start = post_year.apply(lambda row: pd.Timestamp(row.end.year, 1, 1), axis=1)

new_df = pd.concat([df[~mask], prev_year, post_year]).sort_index().reset_index(drop=True)

Steps:

In [3]: df
Out[3]:
   val_1  val_2      start        end  val_3  val_4
0     10     70 2020-01-01 2020-03-04     10     20
1     20     80 2020-01-01 2021-03-06     30     40
2     30     90 2020-01-01 2021-06-04     50     60
3     40    100 2020-12-05 2021-07-04     70     80

In [4]: mask = df.start.dt.year != df.end.dt.year

In [5]: prev_year = df[mask].copy()

In [6]: post_year = df[mask].copy()

In [7]: prev_year.end = prev_year.apply(lambda row: pd.Timestamp(row.start.year, 12, 31), axis=1)

In [8]: post_year.start = post_year.apply(lambda row: pd.Timestamp(row.end.year, 1, 1), axis=1)

In [9]: new_df = pd.concat([df[~mask], prev_year, post_year]).sort_index().reset_index(drop=True)

In [10]: new_df
Out[10]:
   val_1  val_2      start        end  val_3  val_4
0     10     70 2020-01-01 2020-03-04     10     20
1     20     80 2020-01-01 2020-12-31     30     40
2     20     80 2021-01-01 2021-03-06     30     40
3     30     90 2020-01-01 2020-12-31     50     60
4     30     90 2021-01-01 2021-06-04     50     60
5     40    100 2020-12-05 2020-12-31     70     80
6     40    100 2021-01-01 2021-07-04     70     80
  • Related