I have a Dataframe df1
that has a bunch of columns like so:
val_1 | val_2 | start | end | val_3 | val_4 | |
---|---|---|---|---|---|---|
0 | 10 | 70 | 1/1/2020 | 3/4/2020 | 10 | 20 |
1 | 20 | 80 | 1/1/2020 | 3/6/2021 | 30 | 40 |
2 | 30 | 90 | 1/1/2020 | 6/4/2021 | 50 | 60 |
3 | 40 | 100 | 12/5/2020 | 7/4/2021 | 70 | 80 |
4 | 89 | 300 | 4/5/2020 | 6/8/2022 | 40 | 10 |
I need to iterate over the rows, and split the cross-year periods into continuous same year ones. The remaining values in the row need to stay the same and maintain their data types like so:
val_1 | val_2 | start | end | val_3 | val_4 | |
---|---|---|---|---|---|---|
0 | 10 | 70 | 1/1/2020 | 3/4/2020 | 10 | 20 |
1 | 20 | 80 | 1/1/2020 | 12/31/2020 | 30 | 40 |
2 | 20 | 80 | 1/1/2021 | 3/6/2021 | 30 | 40 |
3 | 30 | 90 | 1/1/2020 | 12/31/2020 | 50 | 60 |
4 | 30 | 90 | 1/1/2021 | 6/4/2021 | 50 | 60 |
5 | 40 | 100 | 7/5/2021 | 11/17/2021 | 70 | 80 |
6 | 89 | 300 | 4/5/2020 | 12/31/2020 | 40 | 10 |
7 | 89 | 300 | 1/1/2021 | 12/31/2021 | 40 | 10 |
8 | 89 | 300 | 1/1/2021 | 6/8/2022 | 40 | 10 |
Is there a fast and efficient way to do this? I tried iterating over the rows and doing it but I'm having trouble with the indices and appending rows after an index. Also, people have said that's probably not the best idea to edit things that I'm iterating over so I was wondering if there is a better way to do it. Any suggestions will be appreciated. Thank you!
EDIT
If the row spans more than a year, that should break into 3 or more rows, accordingly. I've edited the tables to accurately reflect this. Thank you!
CodePudding user response:
Find the year end after date_range
, then explode
df['end'] = [[y] pd.date_range(x,y)[pd.date_range(x,y).is_year_end].strftime('%m/%d/%y').tolist() for x , y in zip(df['start'],df['end'])]
df = df.explode('end')
df
Out[29]:
val_1 val_2 start end val_3 val_4
0 10 70 1/1/2020 3/4/2020 10 20
1 20 80 1/1/2020 3/6/2021 30 40
1 20 80 1/1/2020 12/31/20 30 40
2 30 90 1/1/2020 6/4/2021 50 60
2 30 90 1/1/2020 12/31/20 50 60
3 40 100 12/5/2020 7/4/2021 70 80
3 40 100 12/5/2020 12/31/20 70 80
Update
df.end=pd.to_datetime(df.end)
df.start=pd.to_datetime(df.start)
df['Newstart'] = [list(set([x] pd.date_range(x,y)[pd.date_range(x,y).is_year_start].tolist()))
for x , y in zip(df['start'],df['end'])]
df['Newend'] = [[y] pd.date_range(x,y)[pd.date_range(x,y).is_year_end].tolist()
for x , y in zip(df['start'],df['end'])]
out = df.explode(['Newend','Newstart'])
val_1 val_2 start end val_3 val_4 Newstart Newend
0 10 70 2020-01-01 2020-03-04 10 20 2020-01-01 2020-03-04
1 20 80 2020-01-01 2021-03-06 30 40 2021-01-01 2021-03-06
1 20 80 2020-01-01 2021-03-06 30 40 2020-01-01 2020-12-31
2 30 90 2020-01-01 2021-06-04 50 60 2021-01-01 2021-06-04
2 30 90 2020-01-01 2021-06-04 50 60 2020-01-01 2020-12-31
3 40 100 2020-12-05 2021-07-04 70 80 2021-01-01 2021-07-04
3 40 100 2020-12-05 2021-07-04 70 80 2020-12-05 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2020-04-05 2022-06-08
4 89 300 2020-04-05 2022-06-08 40 10 2022-01-01 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2021-01-01 2021-12-31
CodePudding user response:
Here's a different approach (note that I've already converted start
and end
to datetimes):
mask = df.start.dt.year != df.end.dt.year
prev_year = df[mask].copy()
post_year = df[mask].copy()
prev_year.end = prev_year.apply(lambda row: pd.Timestamp(row.start.year, 12, 31), axis=1)
post_year.start = post_year.apply(lambda row: pd.Timestamp(row.end.year, 1, 1), axis=1)
new_df = pd.concat([df[~mask], prev_year, post_year]).sort_index().reset_index(drop=True)
Steps:
In [3]: df
Out[3]:
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2021-03-06 30 40
2 30 90 2020-01-01 2021-06-04 50 60
3 40 100 2020-12-05 2021-07-04 70 80
In [4]: mask = df.start.dt.year != df.end.dt.year
In [5]: prev_year = df[mask].copy()
In [6]: post_year = df[mask].copy()
In [7]: prev_year.end = prev_year.apply(lambda row: pd.Timestamp(row.start.year, 12, 31), axis=1)
In [8]: post_year.start = post_year.apply(lambda row: pd.Timestamp(row.end.year, 1, 1), axis=1)
In [9]: new_df = pd.concat([df[~mask], prev_year, post_year]).sort_index().reset_index(drop=True)
In [10]: new_df
Out[10]:
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2020-12-31 30 40
2 20 80 2021-01-01 2021-03-06 30 40
3 30 90 2020-01-01 2020-12-31 50 60
4 30 90 2021-01-01 2021-06-04 50 60
5 40 100 2020-12-05 2020-12-31 70 80
6 40 100 2021-01-01 2021-07-04 70 80