Suppose that you have a Pandas data frame that can be created using code below:
test_df = pd.DataFrame(
{'start_date': ['2021-07-01', '2021-07-02', '2021-07-03',
'2021-07-04', '2021-07-05', '2021-07-06'],
'end_date': ['2021-07-03', '2021-07-04', '2021-07-05',
'2021-07-06', '2021-07-07', '2021-07-08'],
'returns': [1, 1, 0.99, 0.98, 0.99, 0.97]})
test_df = test_df.sort_values('returns', ascending=False)
Assuming that returns
are always sorted what would be the efficient way to remove overlapping intervals? I do not want to use loops as the data set is large is there a vectorized approach to achieve the output below?
Expected Output
------------ ------------ ---------
| start_date | end_date | returns |
------------ ------------ ---------
| 2021-07-01 | 2021-07-03 | 1 |
| 2021-07-05 | 2021-07-07 | 0.99 |
------------ ------------ ---------
CodePudding user response:
There was an answer from @mozway here that works without a loop by using np.triu()
(numpy's upper triangle).
Slightly changing it for your DataFrame:
import numpy as np
test_df["start_date"] = pd.to_datetime(test_df["start_date"])
test_df["end_date"] = pd.to_datetime(test_df["end_date"])
a = np.triu(test_df['end_date'].values > test_df['start_date'].values[:, None])
b = np.triu(test_df['start_date'].values < test_df['end_date'].values[:, None])
test_df[(a & b).sum(0) == 1]
# start_date end_date returns
#0 2021-07-01 2021-07-03 1.00
#4 2021-07-05 2021-07-07 0.99
Explanation:
test_df['end_date'].values > test_df['start_date'].values[:, None]
#array([[ True, True, True, True, True, True],
# [ True, True, True, True, True, True],
# [False, True, True, True, True, True],
# [False, False, False, True, True, True],
# [False, False, True, True, True, True],
# [False, False, False, True, False, True]])
This returns an array where the end_date values are greater than the start_date values. This looks at each combination of start date and end date (with each end date as a column, and each start date as a row. If the condition is met, then True
is returned.
Taking the upper triangle of this means that only combinations where the start date comes with or before the end date (in terms of test_df
rows) are True
, with the rest False
b
on the other hand looks for all instances where the end date is greater than the start dates (with each start date as a column, and each end date as a row). For the first row of test_df['start_date'].values < test_df['end_date'].values[:, None]
, it is looking at combinations of the first end date 2021-07-03
and whether it is greater than the start dates.
The upper triangle of this means that only occurrences where the start date is before the end date in the dataframe are True
.
Deconstructing the last line test_df[(a & b).sum(0) == 1]
:
(a & b)
is an array whereTrue
occurs where both arrays areTrue
.(a & b).sum(0)
sums each column of this array whereTrue == 1
andFalse == 0
.(a & b).sum(0) == 1
is rows where there is only one occurrence ofa
andb
beingTrue
together. We only want these because we want rows where:
- the start date is less than a previous (or current) end date
AND
- the end date is greater than a previous (or current) start date
HOWEVER
If this occurs more than once, then there must be an overlap. Because the leading diagonal will always be True
(as the end date for a given dataframe row must ALWAYS be greater than the start date) for this scenario, if this occurs more than once this means:
- The start date is less than a previous end date AND the current end date
AND
- The end date is greater than a previous starting date AND the current start date
Which means that there MUST be an overlap here.
I understand this is quite confusing, but it does make sense!