Home > other >  Can you really divide identical strings into groups efficiently through hashes?
Can you really divide identical strings into groups efficiently through hashes?

Time:12-11

I am reading this article this article talking about string hashing. At the "Search for duplicate strings in an array of strings" section it claims that you can group identical strings with a time complexity of O(M*N*log(N)) by using string hashing.

Looking at this code example presented in the article

vector<vector<int>> group_identical_strings(vector<string> const& s) {
    int n = s.size();
    vector<pair<long long, int>> hashes(n);
    for (int i = 0; i < n; i  )
        hashes[i] = {compute_hash(s[i]), i};

    sort(hashes.begin(), hashes.end());

    vector<vector<int>> groups;
    for (int i = 0; i < n; i  ) {
        if (i == 0 || hashes[i].first != hashes[i-1].first)
            groups.emplace_back();
        groups.back().push_back(hashes[i].second);
    }
    return groups;
}

I am very confused as to how this code would be correct, because it creates a new group only on the condition that hashes[i].first != hashes[i-1].first. Two strings can be different while having equal hashes, and as such two strings can be added in the same group even though they're different? This condition doesn't look sufficient to me.

Am I wrong? Is this code correct? Why?

If not, then is this algorithm or atleast this complexity really achievable?

CodePudding user response:

You are very correct that two different strings can have equal hash. This is called a hash collision. However, it boils down to which hash function you use. There are hash functions for which finding a collision is so unlikely that you can well use this algorithm without fear of it breaking. In cryptography, we rely on this property of cryptographically secure hash functions (see e.g. here).

In fact, the very source you mention states the following:

That's the important part that you have to keep in mind. Using hashing will not be 100% deterministically correct, because two complete different strings might have the same hash (the hashes collide). However, in a wide majority of tasks, this can be safely ignored as the probability of the hashes of two different strings colliding is still very small. And we will discuss some techniques in this article how to keep the probability of collisions very low.

Thus, the algorithm is, just as you say, not mathematically correct. But with the right choice of hash function the probability of it breaking down in practice is negligible.

  • Related