Home > Enterprise >  Find smallest set of consecutive pandas rows meeting an aggregate condition
Find smallest set of consecutive pandas rows meeting an aggregate condition

Time:10-18


I have a dataframe that contains datapoints for my most recent run (via Strava API). The dataframe has two columns:

| time_elapsed | distance_covered |
-----------------------------------
|      1       |         1.1      |
|      2       |         2.3      |
|      3       |         3.2      |
...
|     5876     |     15200.3      |

Both columns are cumulative. I would now like to find the fastest 5km that I ran as part of my workout, i.e. the subset of consecutive rows such that

  • row[-1]['distance_covered']-row[0]['distance_covered'] >= 5000
  • row[-1]['time_elapsed'] - row[0]['time_elapsed'] is minimised

My solution below is working; however it is rather slow and does not feel particularly elegant as I'm essentially iterating through the same rows multiple times.

distance_time = defaultdict(list)
for i in range(df.shape[0]):
    working_df = df.iloc[i:].copy()
    previous = df.iloc[i]['distance_covered']
    working_df['distance_covered'] = working_df['distance_covered'] - previous
    ix = np.argmax(working_df['distance_covered'] > 5000)
    if ix == 0:
        break
    else:
        distance_time[5000].append(working_df.iloc[ix]['time_elapsed'] - working_df.iloc[0]['time_elapsed'])
fastest_time = np.nanmin(distance_time[5000])

An alternative would be to calculate interval times, and then use cumsum, but I would still need to look at the dataframe from every possible start row, so I don't think it would be any faster.

Does anybody have a better idea / faster way of achieving the same result?

Edit: actual example output from df.head(10).to_dict()

{'time_elapsed': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9}, 'distance_covered': {0: 1.1, 1: 2.3, 2: 4.4, 3: 7.1, 4: 9.7, 5: 12.3, 6: 15.6, 7: 18.7, 8: 21.7}}

CodePudding user response:

setup

df = pd.DataFrame(
    {
        'time_elapsed': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9},
        'distance_covered': {0: 1.1, 1: 2.3, 2: 4.4, 3: 7.1, 4: 9.7, 5: 12.3, 6: 15.6, 7: 18.7, 8: 21.7}
    },
)

solution

In the following I'm assuming time is in seconds, and distance is in metres. This isn't stated in the question, and if not correct then adjustments will need to be made.

Insert initial point:

df.iloc[-1,:] = [0,0]

Set distance_covered as the index, then reindex in 0.1m (10cm) intervals and interpolate the time column. This is akin to "resampling" when the index is time-based

interpolated = (
    df
    .set_index("distance_covered")
    .sort_index()
    .reindex(np.arange(0,int(df["distance_covered"].max()) 1, 0.1))
    .interpolate()
)

`interpolated is a dataframe with a single column. It looks like this

                  time_elapsed
distance_covered              
0.0                   0.000000
0.1                   0.090909
0.2                   0.181818
0.3                   0.272727
0.4                   0.363636
...                        ...
18.5                  7.937500
18.6                  7.968750
18.7                  8.000000
18.8                  8.000000
18.9                  8.000000

The data is still cumulative, so let's take the difference between successive values, and then do a rolling sum of 50000 rows = 50000 x 0.1m = 5km

interpolated["time_elapsed"].diff().rolling(50000).sum()

This will give you a series, indexed by distance covered, and the values are the time used to cover the preceding 5km. From there find the largest value in your series (or plot etc)

CodePudding user response:

If Python's while loops were not dead slow (compared to Cython/C), you could have solved this using a sliding window. Time complexity of this solution is O(n) since every element in the series is visited at most twice: once by the ending pointer i and once by the starting pointer j.

i = 0
j = 0
dist = df['distance_covered']
time = df['time_elapsed']
res = []
while i < len(df):
    while i < len(df) and dist.iloc[i] - dist.iloc[j] < 5000:
        i = i   1
    if i == len(df):
        break
    while j < i and dist.iloc[i] - dist.iloc[j] >= 5000:
        if not res or time.iloc[res[1]] - time.iloc[res[0]] > time.iloc[i] - time.iloc[j]:
            res = [j, i]
        j = j   1
  • Related