Hashmap slower than string.find?-CodePudding

I am doing exercises from leetcode as a way to learn Rust. One exercise involves finding the longest substring without any character repetition inside a string.

My first idea involved storing substrings in a string and searching the string to see if the character was already in it:

impl Solution {
    pub fn length_of_longest_substring(s: String) -> i32 {
        let mut unique_str = String::from("");
        let mut schars: Vec<char> = s.chars().collect();   
        let mut longest = 0 as i32;
        for x in 0..schars.len()
        {
            unique_str = schars[x].to_string(); 
            for y in x 1..schars.len()
            {            
                if is_new_char(&unique_str, schars[y])
                {
                    unique_str.push(schars[y]);
                } else {
                    break;
                }
            }
            let cur_len = unique_str.len() as i32;
            if cur_len > longest {
                longest = cur_len;
            }
        }
        longest
    }
}

fn is_new_char ( unique_str: &str, c: char ) -> bool {
    if unique_str.find(c) == None
    {
        true
    } else {
        false
    }                
}

It works fine but the performance was on the low side. Hoping to shave a few ms on the "find" operation, I replaced unique_str with a HashMap:

use std::collections::HashMap;
impl Solution {
    pub fn length_of_longest_substring(s: String) -> i32 {
        let mut hash_str = HashMap::new();
        let mut schars: Vec<char> = s.chars().collect(); 
        let mut longest = 0 as i32;
        for x in 0..schars.len()
        {            
            hash_str.insert(schars[x], x);
            for y in x 1..schars.len()
            {            
                if hash_str.contains_key(&schars[y]){
                    break;
                } else {
                    hash_str.insert(schars[y], y); 
                }
            }
            let cur_len = hash_str.len() as i32; 
            if cur_len > longest {
                longest = cur_len;
            }
            hash_str.clear();
        }
        longest
    }
}

Surprisingly, the String.find() version is 3 times faster than the HashMap in the benchmarks, in spite of the fact that I am using the same algorithm (or at least I think so). Intuitively, I would have assumed that doing the lookups in a hashmap should be considerably faster than searching the string's characters, but it turned out to be the opposite.

Can someone explain why the HashMap is so much slower? (or point out what I am doing wrong).

CodePudding user response：

When it comes to performance, one test is always better then 10 reasons.

use std::hash::{Hash, Hasher};

fn main() {
    let start = std::time::SystemTime::now();
    let mut hasher = std::collections::hash_map::DefaultHasher::new();

    let s = "a";
    let string = "ab";
    for i in 0..100000000 {
         s.hash(&mut hasher);
         let hash = hasher.finish();
    }

    eprintln!("{}", start.elapsed().unwrap().as_millis());
}

I use debug build so that compiler would not optimize out most of my code.

On my machine taking 100M hashes above takes 14s. If I replace DefaultHasher with SipHasher as some comments suggested, it takes 17s.

Now, variant with string:

use std::hash::{Hash, Hasher};

fn main() {
    let start = std::time::SystemTime::now();

    let string = "abcde";
    for i in 0..100000000 {
        for c in string.chars() {
            // do nothing
        }
    }

    eprintln!("{}", start.elapsed().unwrap().as_millis());
}

Executing this code with 5 chars in the string takes 24s. If there are 2 chars, it takes 12s.

Now, how does it answer your question?..

To insert a value into a hashset, a hash must be calculated. Then every time you want to check if a character is in the hashset, you need to calculate a hash again. Also there is some small overhead for checking if the value is in the hashset over just calculating the hash.

As we can see from the tests, calculating one hash of a single character string takes around the same time as iterating over 3 symbol string. So let's say you have a unique_str with value abcde, and you check if there is a x character in it. Just checking it would be faster with HashSet, but then you also need to add x into the set, which makes it taking 2 hashes against iterating 5-symbol string.

So as long as on average your unique_str is shorter than 5 symbols, string realization is guaranteed to be faster. And in case of an input string like aaaaaaaaa...., it will be ~6 times faster, then the HashSet option.

Of course this analisys is very simplistic and there can be many other factors in play (like compiler optimizations and specific realization of Hash and Find for strings), but it gives the idea, why in some cases HashSet can be slower then string.find().

Side note: in your code you use HashMap instead of HashSet, which adds even more overhead and is not needed in your case...