I have a dataframe that looks like this:
import pandas as pd
data = [['A', '2022-09-01', '2022-09-05', 10], ['A', '2022-09-05', '2022-09-15', 1], ['A', '2022-09-15', '2022-09-18', 12], ['B', '2022-09-01', '2022-09-03', 4], ['B', '2022-09-03', '2022-09-07', 7], ['B', '2022-09-07', '2022-09-12', 9]]
df = pd.DataFrame(data, columns=['GROUP', 'start_date', 'end_date', 'value'])
GROUP start_date end_date value
0 A 2022-09-01 2022-09-05 10
1 A 2022-09-05 2022-09-15 1
2 A 2022-09-15 2022-09-18 12
3 B 2022-09-01 2022-09-03 4
4 B 2022-09-03 2022-09-07 7
5 B 2022-09-07 2022-09-12 9
I have a certain_date
, for example, 2022-09-10
. I would like to split the row where the certain_date
is in the range of start_date
and end_date
per row per group. If the certain_date
is in the range of the two dates, the end_date
of that row should change to certain_date
and add an extra row below where the start_date
is the certain_date
and the end_date
is the original end_date
where the value
should be the same for both rows. Here you can see the expected output:
certain_date = '2022-09-10'
data = [['A', '2022-09-01', '2022-09-05', 10], ['A', '2022-09-05', '2022-09-10', 1], ['A', '2022-09-10', '2022-09-15', 1], ['A', '2022-09-15', '2022-09-18', 12], ['B', '2022-09-01', '2022-09-03', 4], ['B', '2022-09-03', '2022-09-07', 7], ['B', '2022-09-07', '2022-09-10', 9], ['B', '2022-09-10', '2022-09-12', 9]]
df_desired = pd.DataFrame(data, columns=['GROUP', 'start_date', 'end_date', 'value'])
GROUP start_date end_date value
0 A 2022-09-01 2022-09-05 10
1 A 2022-09-05 2022-09-10 1
2 A 2022-09-10 2022-09-15 1
3 A 2022-09-15 2022-09-18 12
4 B 2022-09-01 2022-09-03 4
5 B 2022-09-03 2022-09-07 7
6 B 2022-09-07 2022-09-10 9
7 B 2022-09-10 2022-09-12 9
For GROUP A
you can see that the certain_date
is in the range of the dates in the second row of the dataframe. As you can see it gets to split the way I described above. So I was wondering if there is a way to solve this using pandas
?
CodePudding user response:
You can identify the matching rows, then concat the dataframe without those and the slice with changed start or stop:
certain_date = '2022-09-10'
# is date after start?
m1 = df['start_date'].lt(certain_date)
# is date before stop?
m2 = df['end_date'].gt(certain_date)
# is both? (you could do all in one line)
m = m1&m2
out = pd.concat([df[~m],
df[m].assign(start_date=certain_date),
df[m].assign(end_date=certain_date)]).sort_index()
output:
GROUP start_date end_date value
0 A 2022-09-01 2022-09-05 10
1 A 2022-09-10 2022-09-15 1
1 A 2022-09-05 2022-09-10 1
2 A 2022-09-15 2022-09-18 12
3 B 2022-09-01 2022-09-03 4
4 B 2022-09-03 2022-09-07 7
5 B 2022-09-10 2022-09-12 9
5 B 2022-09-07 2022-09-10 9
CodePudding user response:
You can try split the start_date
, certain_date
and end_date
into list then explode the result
df[['start_date', 'end_date']] = df.apply(lambda row: [[row['start_date'], certain_date],
[certain_date, row['end_date']]]
if row['start_date'] < certain_date < row['end_date']
else [row['start_date'], row['end_date']],
axis=1, result_type='expand')
out = df.explode(['start_date', 'end_date'], ignore_index=True)
print(out)
GROUP start_date end_date value
0 A 2022-09-01 2022-09-05 10
1 A 2022-09-05 2022-09-10 1
2 A 2022-09-10 2022-09-15 1
3 A 2022-09-15 2022-09-18 12
4 B 2022-09-01 2022-09-03 4
5 B 2022-09-03 2022-09-07 7
6 B 2022-09-07 2022-09-10 9
7 B 2022-09-10 2022-09-12 9