Time series compression with interpolation-CodePudding

I basically have an algorithm, but it is really slow. Since my algorithm/problem is so simple, I expect, that this may exist somewhere (in fast) and there might also be a name for this. And before I start developing a faster version of my algorithm, I first try to ask here (I don't want to reinvent things).

The problem is simple: I have a time series from an experiment, which is quite large (~5 GB). The thing is, that most of the data points are placed on a line, e.g.

(t=0.0, y=0.0), ... , (t=1.0, y=0.5), ... , (t=2.0, y=1.0)

This could obviously be simplified by interpolating the first and the last point with a straight line. In principle, I can test, if the points between an interval can be approximated by a straight line, within some tolerance (I don't need lossless compression) and throw away the points in between.

My current algorithm works as follows:

I have points within an interval [a,b] and I create a linear interpolation between the first and the last point (let's call this interpolation f).
Then, I compute the error Abs(f(t) - y) at each time series point and select the point, with the largest error (let's cal this point tmax).
I split the interval [a,b] -> [a, tmax], [tmax, b]
Repeat my algorithm on the sub intervals, until a tolerance is reached, or the interval contains only one or 2 points. Return the interval boundaries.

This algorithm works surprisingly well in approximating a signal, but it is really slow and as already said, I believe that there exist already something, which does the same thing or solves my problem.

Thanks for the help, if anything is unclear, don't hesitate to ask.

CodePudding user response：

It looks like you want the Swinging Door compression algorithm. It basically works by using the mental image of a pair of doors to quickly absorb points into a range that can be approximated by a single straight line. It shows up a lot for processing time series in industrial automation. Which is a domain where people wind up collecting a lot of data, very quickly, and needing to summarize it on the fly before doing other calculations.

I won't explain it because there are plenty of good explanations out there, with source code. Here a links to a couple.

Swinging Door in PostgreSQL

Swinging Door in Python