I'm trying to bin (downsample) a time series based on its timestamps. For instance:
import numpy as np
import pandas as pd
timestamps = np.linspace(0, 1000, 10000)
values = np.random.random(10000)
I usually convert it to a dataframe, and use cut (or qcut) to create the bins:
timeseries_df = pd.DataFrame({"Timestamps": timestamps, "Values": values})
timeseries_df["Bins"] = pd.cut(timeseries_df["Timestamps"],100) #downsampling by two orders of magnitude
ds_timestamps = timeseries_df.groupby("Bins").max()["Timestamps"]
ds_values = timeseries_df.groupby("Bins").mean()["Values"]
This works, but I'm writing functions that I can reuse and I'd like to avoid using pandas if possible. I've tried implementing a version of what's been suggested here
ds_timestamps = np.linspace(timestamps.min(), timestamps.max(), 100)
digitized_timestamps = np.digitize(timestamps, ds_timestamps)
ds_values = [values[digitized_timestamps == i 1].mean() for i in range(len(ds_timestamps))]
This also works but is extremely slow. Is there another way of doing this?
CodePudding user response:
As mentioned in the comments, if your primary concern for not using Pandas is speed, I'd actually recommend using it, because it's not written entirely in Python, but it has many internal portions written using Cython (basically C), so they're very, very fast.