Home > Blockchain >  What is the data structure best fit for fast lookup and memory efficiency on a collection of Strings
What is the data structure best fit for fast lookup and memory efficiency on a collection of Strings

Time:03-21

In my program, I have a huge list of strings (non-duplicate), which are crossed against a word. I want to check if the word is valid — its validity is based on its presence in the list. I'm running this check on many different words, so speed is my biggest concern, along with memory efficiency for the storage of the list.

So my question is, what is the data structure best fit for fast lookup and memory efficiency on a collection of Strings?

I'm currently using a BTreeSet, since it is ordered and I assume it makes the search faster. Also, since the words can be very similar ("apple" and "app"), I have assumed a tree-based data structure is more memory efficient to store them by a Vec.

CodePudding user response:

If you need an exact match, then you should use a HashSet. It should be faster and more efficient than the BTreeSet.

But if you need more sophisticated search/match capabilities, such as searching by common prefix, both case sensitive & insensitive search, a huge number of words with large common prefix, etc, then the hashmap will not be a good fit.

In that case you can use a trie also known as prefix tree. It stores the words character by character, thus "compressing" common prefixes, which also provides a significant speedup if you search for many words that start with a common prefix.

A very basic implementation is just a wrapper around a hash map:

struct Node {
    ptr: HashMap<u8, Node>,

    // you can also use a boolean to mark the end of word, but then you would need additional code to rebuild it from the bytes 
    word: Option<String>,
}

You add a word to the tree by either recursively adding the next word character to the tree or using a bit more performant iterative approach:

   pub fn insert(&mut self, word: String) {
        let w = word.as_bytes();

        let mut node = self;
        for ch in w.iter().copied() {
            node = node.ptr.entry(ch).or_default()
        }

        // attach the word
        node.word = Some(word);
    }

The basic searching is also very simple:

    fn find(&self, word: String) -> Option<String> {
        let word = word.as_bytes()
        let mut node = self;

        for ch in word.iter() {
            match node.ptr.get(ch) {
                Some(link) => node = link,
                None => return None,
            }
        }

        node.word.clone()
    }

But if you want to take advantage of searching "simultaneously" for any word with a prefix, you should take advantage of the recursive structure like in this leetcode problem

There are also some implementation considerations to be made:

  • use a faster hash function, because we are hashing just single bytes, and the default hash function in rust is pretty slow
  • you can "soft-remove" nodes from the trie after they have been found in order to speed up subsequent searches.

You can find such examples in the provided github link

  • Related