What is the fastest way to filter a pandas time series? For now I use boolean masking to filter the time series ts:
import time
from datetime import datetime
import pandas as pd
import statistics
# create time series
idx = pd.date_range(start='2022-01-01', end='2023-01-01', freq="min")
ts = pd.Series(1, index=idx)
start_dt = datetime(2022, 1, 1, 0, 0, 0)
end_dt = datetime(2022, 1, 2, 0, 0, 0)
time_lst = []
# measure performance of boolean masking
for i in range(100):
start = time.time()
# 1st method
mask = (ts.index > start_dt) & (ts.index <= end_dt)
# 2nd method, nearly same velociy
# mask = np.where((ts.index > start_dt) & (ts.index <= end_dt), True, False)
time_lst.append(time.time() - start)
print(statistics.mean(time_lst))
filtered_ts = ts.loc[mask]
I am wondering, if this is already the fastest way (here ~0.003 s per run) or are there other methods? I use the masking many thousands of times for different start_dt and end_dt and it sums up to a significant time which I want to reduce.
CodePudding user response:
Your solution is really fast:
%timeit ts[(ts.index > start_dt) & (ts.index <= end_dt)]
5.02 ms ± 413 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit ts[ts.index.to_series().between(start_dt, end_dt, inclusive='left')]
8.22 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But if possible change solution with include both datetimes using Series.loc
is faster:
%timeit ts[(ts.index >= start_dt) & (ts.index <= end_dt)]
%timeit ts.loc[start_dt:end_dt]
138 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)