I'm designing a latency-sensitive application in python in which I will have several arrays of timestamps. I am trying to count the number of occurrences of an event in the last 1,5,25,50, and 100 seconds, so 5 arrays in total. I plan to append these arrays with the time of the event whenever it occurs. Then in a separate thread, I will be removing values that are older than the past 1,5,25,50, or 100 seconds.
I expect to usually have less than 1000 occurrences per 100 seconds, but the theoretical maximum is 10,000 per 100 seconds. I was planning to use a basic array of datetime objects, but I'm interested in learning what data structures are faster for this. At first, I was considering a pandas dataframe but this proved to be far too slow. I know about numpy arrays and using time.time() instead, but I imagine there are other ways that may be even more efficient. Would love to hear what way of doing this will be the most computationally efficient from someone who is an expert in python.
CodePudding user response:
10-100 occurrences per second is nothing really. Simply using a list should be pretty fast. I would probably go with 1 list only. Using 5 collections in this case seems like an overkill and not worth it really.
One thing is the write performance, another thing is the read performance. Unless you need to have like a million reads per second, you should be fine with just looping over the list and counting the values in there.
One more tip: you would probably want to store the timestamps as numbers (Unix timestamp, seconds since 1970-01-01). And when you do the reading, do not get the current timestamp for every comparison. First get the current timestamp, save it into a local variable and then compare with that local variable. Like this:
curr_time = time.time()
for t in times:
sec_ago = t - curr_time
...
Even though CPython
is relatively slow when it comes to simple loops, comparisons, etc., performance should still be pretty good for your case, I guess. But if you really need extreme performance, you might consider implementing something more "native" in C
, Rust
, Cython
, etc.
Try it!
CodePudding user response:
A priority queue (heapq
) would let you keep timestamps in order, as threads finish execution. Heap insertion is O(log(n))
. Removing oldest items from a sorted heap can be done by using a dictionary for a O(1)
or constant time lookup.
If you use an array and just append timestamps as threads finish execution, you'd have to periodically lock the array from append operations, and then resort (O(nlog(n))
) to remove oldest items. This might be simpler, conceptually, but take longer to run overall. Especially if you do this thousands or millions of times.
Using timeit
with any approach is a good way to profile insertion, lookup, and deletion of whatever data structures you use.
CodePudding user response:
Here's another approach.
If you keep timestamps as integers (seconds from epoch), then you can directly use an array of size 2^31 - 1 - minimum_timestamp
as a perfect hash table.
Using an array with a perfect hash function is memory efficient, requiring only O(n)
space. It is fast, requiring only O(1)
constant time for insertion (writing) and querying (reading).
The value of minimum_timestamp
is the minimum timestamp for which you need to track events. This would typically be whatever the current time is, unless you need to track events further back in the past.
The value of 2^31 - 1
is used here because that is the maximum value of seconds from epoch, so this solution would work at least until 2038, in the worst-case scenario where minimum_timestamp
is zero, i.e. as far back as 1970.
All elements of this array are initialized to 0
and correspond to the number of events observed.
Every index into the array corresponds to a timepoint after minimum_timestamp
.
When an event occurs at timepoint n
(zero-indexed, in units of seconds from epoch), the array at index n
is incremented by one, by subtracting 2^31 - 1 - n
to get the index to increment.
You could keep a current_timestamp
value, where current_timestamp > minimum_timestamp 100
. To read the last 1, 5, 25, 50, and 100 seconds of data, you need only subtract from current_timestamp
those values, to get the required index to read into the array.
As your threads work through processing events, you would periodically update current_timestamp
as you read out or query the array.
This is not Python-specific, though you could use Cython to manage the underlying data and lookups, if you want the best performance out of a Python-based approach.
Adding to the end of a list ("array") is relatively cheap, except when lists that grow need resizing. Sorting lists for querying is expensive. A perfect hash table will help you avoid sorting, if you can rethink your data in terms of unique indices into a pre-allocated array.