Home > Enterprise >  Circular Buffer HashSet in Java
Circular Buffer HashSet in Java

Time:11-14

Is possible to create an HashSet<Long> that will work like an buffer with fixed size and still have an O(1) complexity (or at least similar)?

I need an HashSet to save large amount of Longs (without duplicity) one-by-one and then read it by contains(Long) as fast as possible.

I my usecase it is around 10 milion longs in 1 day. Thats why I need to have an circular buffer with limit. I can then limit HashSet size to e.g 1 milion and just start overwriting old items when I go over 1 milion items in set.

What I tried: HashMap<Int, Long> and Int currentIndex, Int maxItems valued. I was able to start overwriting old valued when currentIndex > maxItems... Sadly, searching by value in this HashMap is slow.

CodePudding user response:

A similar question has already been asked. Did you take a look at it? Java Collection for special scrolling, circular queue

CodePudding user response:

I'm not sure if those longs are timestamps or just IDs. Either way though, HashSet cannot do what you want; it's the wrong data store. HashSets have an arbitrary order; asking a hashset 'what is the lowest amongst all longs inside it' is not something it can answer without iterating over the whole thing, nor can you ask for the 'first one I put in'.

You have a few solutions:

  • TreeSet. Technically, everything is O(logn) but ~120 items a second (10 million per 24 hours boils down to about that) is nothing, and as a consequence, O(logn) is equivalent to O(1) for all intents and purposes here. TreeSets 'self sort' - asking a treeset for the lowest long in it is fast. (1 million items? Takes about 20 lookups, that's what O(logn) means - every order of magnitude just adds 1 lookup. 100 million items would only take 25 lookups, more or less). If those longs are timestamps, and once the treeset's size hits 1 million you want to wipe out the 'oldest', TreeSet can do it, and can do so very quickly.
  • LinkedHashSet - this is a doubled-up datastructure, letting you look up both by key as well as 'get me the oldest entry'. Whilst the memory load is larger because of this, the speed is O(1): Asking a LinkedHashMap/Set for the 'oldest' entry is as instant as it is with asking a TreeSet for the smallest key.

If you put things in the map at the time they occur and using the timestamp as a key, either one is fine - because 'oldest in the data structure' and 'entry in the data structure with the lowest key' boil down to the same thing.

But there are still more data types that may qualify. an ArrayDeque is a circular data structure. It's very very similar to an ArrayList, except, adding/removing at the start of the data structure is just as instant as adding/removing at the end. Looking up by key value is slow just like it is with arraylists - if that's your need, look at TreeSet or LinkedHashSet/Map.

Or, you get straight to the point and use guava's Cache mechanism. It is specifically designed for the job and has an API to match, including the ability to just tell the data store itself to clean out the oldest member so that it never grows beyond 1 million, and you can even ask the collection object itself that you want this cache cleaning to occur 'on write' (i.e. if full, remove the oldest entry) or 'on access' (i.e. if full, remove the entry that has not been looked up the longest - that is, any lookup also 'refreshes' it, effectively).

  • Related