Home > Net >  What is the fastest way to filter a pandas time series?
What is the fastest way to filter a pandas time series?

Time:01-10

What is the fastest way to filter a pandas time series? For now I use boolean masking to filter the time series ts:

import time
from datetime import datetime
import pandas as pd
import statistics

# create time series
idx = pd.date_range(start='2022-01-01', end='2023-01-01', freq="min")
ts = pd.Series(1, index=idx)

start_dt = datetime(2022, 1, 1, 0, 0, 0)
end_dt = datetime(2022, 1, 2, 0, 0, 0)

time_lst = []

# measure performance of boolean masking
for i in range(100):
    start = time.time()

    # 1st method
    mask = (ts.index > start_dt) & (ts.index <= end_dt)

    # 2nd method, nearly same velociy
    # mask = np.where((ts.index > start_dt) & (ts.index <= end_dt), True, False)

    time_lst.append(time.time() - start)

print(statistics.mean(time_lst))

filtered_ts = ts.loc[mask]

I am wondering, if this is already the fastest way (here ~0.003 s per run) or are there other methods? I use the masking many thousands of times for different start_dt and end_dt and it sums up to a significant time which I want to reduce.

CodePudding user response:

Your solution is really fast:

%timeit ts[(ts.index > start_dt) & (ts.index <= end_dt)]
5.02 ms ± 413 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit ts[ts.index.to_series().between(start_dt, end_dt, inclusive='left')]
8.22 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But if possible change solution with include both datetimes using Series.loc is faster:

%timeit ts[(ts.index >= start_dt) & (ts.index <= end_dt)]

%timeit ts.loc[start_dt:end_dt]
138 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • Related