Home > front end >  Pandas Data Frame - Remove Overlapping Intervals
Pandas Data Frame - Remove Overlapping Intervals

Time:05-15

Suppose that you have a Pandas data frame that can be created using code below:

test_df = pd.DataFrame(
    {'start_date': ['2021-07-01', '2021-07-02', '2021-07-03',
                    '2021-07-04', '2021-07-05', '2021-07-06'],
     'end_date': ['2021-07-03', '2021-07-04', '2021-07-05',
                  '2021-07-06', '2021-07-07', '2021-07-08'],
     'returns': [1, 1, 0.99, 0.98, 0.99, 0.97]})
test_df = test_df.sort_values('returns', ascending=False)

Assuming that returns are always sorted what would be the efficient way to remove overlapping intervals? I do not want to use loops as the data set is large is there a vectorized approach to achieve the output below?

Expected Output

 ------------ ------------ --------- 
| start_date |  end_date  | returns |
 ------------ ------------ --------- 
| 2021-07-01 | 2021-07-03 |       1 |
| 2021-07-05 | 2021-07-07 |    0.99 |
 ------------ ------------ --------- 

CodePudding user response:

There was an answer from @mozway here that works without a loop by using np.triu() (numpy's upper triangle).

Slightly changing it for your DataFrame:

import numpy as np
test_df["start_date"] = pd.to_datetime(test_df["start_date"])
test_df["end_date"] = pd.to_datetime(test_df["end_date"])

a = np.triu(test_df['end_date'].values > test_df['start_date'].values[:, None])
b = np.triu(test_df['start_date'].values < test_df['end_date'].values[:, None])
test_df[(a & b).sum(0) == 1]
#  start_date   end_date  returns
#0 2021-07-01 2021-07-03     1.00
#4 2021-07-05 2021-07-07     0.99

Explanation:

test_df['end_date'].values > test_df['start_date'].values[:, None]
#array([[ True,  True,  True,  True,  True,  True],
#       [ True,  True,  True,  True,  True,  True],
#       [False,  True,  True,  True,  True,  True],
#       [False, False, False,  True,  True,  True],
#       [False, False,  True,  True,  True,  True],
#       [False, False, False,  True, False,  True]])

This returns an array where the end_date values are greater than the start_date values. This looks at each combination of start date and end date (with each end date as a column, and each start date as a row. If the condition is met, then True is returned.

Taking the upper triangle of this means that only combinations where the start date comes with or before the end date (in terms of test_df rows) are True, with the rest False

b on the other hand looks for all instances where the end date is greater than the start dates (with each start date as a column, and each end date as a row). For the first row of test_df['start_date'].values < test_df['end_date'].values[:, None], it is looking at combinations of the first end date 2021-07-03 and whether it is greater than the start dates.

The upper triangle of this means that only occurrences where the start date is before the end date in the dataframe are True.

Deconstructing the last line test_df[(a & b).sum(0) == 1]:

  • (a & b) is an array where True occurs where both arrays are True.
  • (a & b).sum(0) sums each column of this array where True == 1 and False == 0.
  • (a & b).sum(0) == 1 is rows where there is only one occurrence of a and b being True together. We only want these because we want rows where:
  1. the start date is less than a previous (or current) end date

AND

  1. the end date is greater than a previous (or current) start date

HOWEVER

If this occurs more than once, then there must be an overlap. Because the leading diagonal will always be True (as the end date for a given dataframe row must ALWAYS be greater than the start date) for this scenario, if this occurs more than once this means:

  1. The start date is less than a previous end date AND the current end date

AND

  1. The end date is greater than a previous starting date AND the current start date

Which means that there MUST be an overlap here.

I understand this is quite confusing, but it does make sense!

  • Related