I basically have an algorithm, but it is really slow. Since my algorithm/problem is so simple, I expect, that this may exist somewhere (in fast) and there might also be a name for this. And before I start developing a faster version of my algorithm, I first try to ask here (I don't want to reinvent things).
The problem is simple: I have a time series from an experiment, which is quite large (~5 GB). The thing is, that most of the data points are placed on a line, e.g.
(t=0.0, y=0.0), ... , (t=1.0, y=0.5), ... , (t=2.0, y=1.0)
This could obviously be simplified by interpolating the first and the last point with a straight line. In principle, I can test, if the points between an interval can be approximated by a straight line, within some tolerance (I don't need lossless compression) and throw away the points in between.
My current algorithm works as follows:
- I have points within an interval [a,b] and I create a linear interpolation between the first and the last point (let's call this interpolation f).
- Then, I compute the error Abs(f(t) - y) at each time series point and select the point, with the largest error (let's cal this point tmax).
- I split the interval [a,b] -> [a, tmax], [tmax, b]
- Repeat my algorithm on the sub intervals, until a tolerance is reached, or the interval contains only one or 2 points. Return the interval boundaries.
This algorithm works surprisingly well in approximating a signal, but it is really slow and as already said, I believe that there exist already something, which does the same thing or solves my problem.
Thanks for the help, if anything is unclear, don't hesitate to ask.
CodePudding user response:
It looks like you want the Swinging Door compression algorithm. It basically works by using the mental image of a pair of doors to quickly absorb points into a range that can be approximated by a single straight line. It shows up a lot for processing time series in industrial automation. Which is a domain where people wind up collecting a lot of data, very quickly, and needing to summarize it on the fly before doing other calculations.
I won't explain it because there are plenty of good explanations out there, with source code. Here a links to a couple.