Splitting DataFrame rows into multiple based on a condition-CodePudding

I have a Dataframe df1 that has a bunch of columns like so:

	val_1	val_2	start	end	val_3	val_4
0	10	70	1/1/2020	3/4/2020	10	20
1	20	80	1/1/2020	3/6/2021	30	40
2	30	90	1/1/2020	6/4/2021	50	60
3	40	100	12/5/2020	7/4/2021	70	80
4	89	300	4/5/2020	6/8/2022	40	10

I need to iterate over the rows, and split the cross-year periods into continuous same year ones. The remaining values in the row need to stay the same and maintain their data types like so:

	val_1	val_2	start	end	val_3	val_4
0	10	70	1/1/2020	3/4/2020	10	20
1	20	80	1/1/2020	12/31/2020	30	40
2	20	80	1/1/2021	3/6/2021	30	40
3	30	90	1/1/2020	12/31/2020	50	60
4	30	90	1/1/2021	6/4/2021	50	60
5	40	100	7/5/2021	11/17/2021	70	80
6	89	300	4/5/2020	12/31/2020	40	10
7	89	300	1/1/2021	12/31/2021	40	10
8	89	300	1/1/2021	6/8/2022	40	10

Is there a fast and efficient way to do this? I tried iterating over the rows and doing it but I'm having trouble with the indices and appending rows after an index. Also, people have said that's probably not the best idea to edit things that I'm iterating over so I was wondering if there is a better way to do it. Any suggestions will be appreciated. Thank you!

EDIT

If the row spans more than a year, that should break into 3 or more rows, accordingly. I've edited the tables to accurately reflect this. Thank you!

CodePudding user response：

Find the year end after date_range, then explode

df['end'] = [[y] pd.date_range(x,y)[pd.date_range(x,y).is_year_end].strftime('%m/%d/%y').tolist() for x , y in zip(df['start'],df['end'])]
df = df.explode('end')
df
Out[29]: 
   val_1  val_2      start       end  val_3  val_4
0     10     70   1/1/2020  3/4/2020     10     20
1     20     80   1/1/2020  3/6/2021     30     40
1     20     80   1/1/2020  12/31/20     30     40
2     30     90   1/1/2020  6/4/2021     50     60
2     30     90   1/1/2020  12/31/20     50     60
3     40    100  12/5/2020  7/4/2021     70     80
3     40    100  12/5/2020  12/31/20     70     80

Update

df.end=pd.to_datetime(df.end)
df.start=pd.to_datetime(df.start)
df['Newstart'] = [list(set([x] pd.date_range(x,y)[pd.date_range(x,y).is_year_start].tolist()))
                  for x , y in zip(df['start'],df['end'])]
df['Newend'] = [[y] pd.date_range(x,y)[pd.date_range(x,y).is_year_end].tolist()
                  for x , y in zip(df['start'],df['end'])]

out = df.explode(['Newend','Newstart'])
val_1  val_2      start        end  val_3  val_4   Newstart     Newend
0     10     70 2020-01-01 2020-03-04     10     20 2020-01-01 2020-03-04
1     20     80 2020-01-01 2021-03-06     30     40 2021-01-01 2021-03-06
1     20     80 2020-01-01 2021-03-06     30     40 2020-01-01 2020-12-31
2     30     90 2020-01-01 2021-06-04     50     60 2021-01-01 2021-06-04
2     30     90 2020-01-01 2021-06-04     50     60 2020-01-01 2020-12-31
3     40    100 2020-12-05 2021-07-04     70     80 2021-01-01 2021-07-04
3     40    100 2020-12-05 2021-07-04     70     80 2020-12-05 2020-12-31
4     89    300 2020-04-05 2022-06-08     40     10 2020-04-05 2022-06-08
4     89    300 2020-04-05 2022-06-08     40     10 2022-01-01 2020-12-31
4     89    300 2020-04-05 2022-06-08     40     10 2021-01-01 2021-12-31

CodePudding user response：

Here's a different approach (note that I've already converted start and end to datetimes):

mask = df.start.dt.year != df.end.dt.year
prev_year = df[mask].copy()
post_year = df[mask].copy()

prev_year.end = prev_year.apply(lambda row: pd.Timestamp(row.start.year, 12, 31), axis=1)
post_year.start = post_year.apply(lambda row: pd.Timestamp(row.end.year, 1, 1), axis=1)

new_df = pd.concat([df[~mask], prev_year, post_year]).sort_index().reset_index(drop=True)

Steps:

In [3]: df
Out[3]:
   val_1  val_2      start        end  val_3  val_4
0     10     70 2020-01-01 2020-03-04     10     20
1     20     80 2020-01-01 2021-03-06     30     40
2     30     90 2020-01-01 2021-06-04     50     60
3     40    100 2020-12-05 2021-07-04     70     80

In [4]: mask = df.start.dt.year != df.end.dt.year

In [5]: prev_year = df[mask].copy()

In [6]: post_year = df[mask].copy()

In [7]: prev_year.end = prev_year.apply(lambda row: pd.Timestamp(row.start.year, 12, 31), axis=1)

In [8]: post_year.start = post_year.apply(lambda row: pd.Timestamp(row.end.year, 1, 1), axis=1)

In [9]: new_df = pd.concat([df[~mask], prev_year, post_year]).sort_index().reset_index(drop=True)

In [10]: new_df
Out[10]:
   val_1  val_2      start        end  val_3  val_4
0     10     70 2020-01-01 2020-03-04     10     20
1     20     80 2020-01-01 2020-12-31     30     40
2     20     80 2021-01-01 2021-03-06     30     40
3     30     90 2020-01-01 2020-12-31     50     60
4     30     90 2021-01-01 2021-06-04     50     60
5     40    100 2020-12-05 2020-12-31     70     80
6     40    100 2021-01-01 2021-07-04     70     80