Checking if two strings are equal after removing a subset of characters from both-CodePudding

I recently came across this problem:

You are given two strings, s1 and s2, comprised entirely of lowercase letters 'a' through 'r', and need to process a series of queries. Each query provides a subset of lowercase English letters from 'a' through 'r'. For each query, determine whether s1 and s2, when restricted only to the letters in the query, are equal. s1 and s2 can contain up to 10^5 characters, and there are up to 10^5 queries.

For instance, if s1 is "aabcd" and s2 is "caabd", and you are asked to process a query with the subset "ac", then s1 becomes "aac" while s2 becomes "caa". These don't match, so the query would return false.

I was able to solve this in O(N^2) time by doing the following: For each query, I checked if s1 and s2 would be equal by iterating through both strings, one character at a time, skipping the characters that do not lie within the subset of allowed characters, and checking to see if the "allowed" characters from both s1 and s2 match. If at some point, the characters don't match, then the strings are not equal. Otherwise, the s1 and s2 are equal when restricted only to letters in the query. Each query takes O(N) time to process, and there are N queries, for a total of O(N^2) time.

However, I was told that there was a way to solve this faster than O(N^2), using bitwise operators and a bit of clever math. Does anyone know how this might be done?

CodePudding user response：

The first obvious speedup is to ensure your set membership test is O(1). To do that, there's a couple of options:

Represent every letter as a single bit -- now every character is an 18-bit value with only one bit set. The set of allowed characters is now a mask with these bits ORed together and you can test membership of a character with a bitwise-AND;
Alternatively, you can have an 18-value array and index it by character (c - 'a' would give a value between 0 and 17). The test for membership is then basically the cost of an array lookup (and you can save operations by not doing the subtraction -- instead just make the array larger and index directly by character.

Thought experiment

The next potential speedup is to recognize that any character which does not appear exactly the same number of times in both strings will instantly be a failed match. You can count all character frequencies in both strings with a histogram which can be done in O(N) time. In this way, you can prune the search space if such a character were to appear in the query, and you can test for this in constant time.

Of course, that won't help for a real stress-test which will guarantee that all possible letters have a frequency matched in both strings. So, what do you do then?

Well, you extend the above premise by recognizing that for any position of character x in string 1 and some position of that character in string 2 that would be a valid match (i.e the same number of character x appears in both strings up to their respective positions), then the total count of any other character up to those positions must also be equal. For any character where that is not true, it cannot possibly be compatible with character x.

Concept

Let's start by thinking about this in terms of a technique known as memoization where you can leverage precomputed or partially-computed information and get a whole lot out of it. So consider two strings like this:

a  b  a  a  c  d  e   |   e  a  b  a  c  a  d

What useful thing can you do here? Well, why not store the partial sums of counts for each letter:

          a  b  a  a  c  d  e  |  e  a  b  a  c  a  d
         ----------------------|----------------------
freq(a) = 1  1  2  3  3  3  3  |  0  1  1  2  2  3  3
freq(b) = 0  1  1  1  1  1  1  |  0  0  1  1  1  1  1
freq(c) = 0  0  0  0  1  1  1  |  0  0  0  0  1  1  1
freq(d) = 0  0  0  0  0  1  1  |  0  0  0  0  0  0  1
freq(e) = 0  0  0  0  0  0  1  |  1  1  1  1  1  1  1

This uses a whole lot of memory, but don't worry -- we'll deal with that later. Instead, take the time to absorb what we're actually doing.

Looking at the table above, we have the running character count totals for both strings at every position in those strings.

Now let's see how our matching rules work by showing an example of a matching query "ab" and a non-matching query "acd":

For "ab":

           a  b  a  a  c  d  e  |  e  a  b  a  c  a  d
          ----------------------|----------------------
 freq(a) = 1  1  2  3  3  3  3  |  0  1  1  2  2  3  3
 freq(b) = 0  1  1  1  1  1  1  |  0  0  1  1  1  1  1
           ^  ^  ^  ^                 ^  ^  ^     ^

We scan the frequency arrays until we locate one of our letters in the query. The locations I have marked with ^ above. If you remove all the unmarked columns, you'll see the remaining columns match on both sides. So this is a match.

For "acd":

           a  b  a  a  c  d  e  |  e  a  b  a  c  a  d
          ----------------------|----------------------
 freq(a) = 1  1  2  3  3  3  3  |  0  1  1  2  2  3  3
 freq(c) = 0  0  0  0  1  1  1  |  0  0  0  0  1  1  1
 freq(d) = 0  0  0  0  0  1  1  |  0  0  0  0  0  0  1
           ^     ^  #  #  ^           ^     ^  #  #  ^

Here, all columns are matching except those marked with #.

Putting it together

All right so you can see how this works, but you may wondering about the runtime, because those examples above seem to be doing even more scanning than you were doing before!

Here's where things get interesting, and where our character frequency counts really kick in.

First, observe what we're actually doing on those marked columns. For any one character that we care about (for example, the 'a'), we're looking for only the positions in both strings where its count matches, and then we're comparing these two columns to see what other values match. This gives us a set of all other characters that are valid when used with 'a'. And of course, 'a' itself is also valid.

And what was our very first optimization? A bitset -- an 18-bit value the represents valid characters. You can now use this. For the two columns in each string, you set a 1 for characters with matching counts and a zero for characters with non-matching counts. If you process every single pair of matching 'a' values in this manner, what you get is a bunch of sets that they work with. And you can keep a running "master" set that represents the intersection of these -- you just intersect it with each intermediate set you calculate, which is a single bitwise-AND.

By the time you reach the end of both strings, you have performed a O(N) search and you examined 18 rows any time you encountered 'a'. And the result was the set of all characters that work with 'a'.

Now repeat for the other characters, one at a time. Every time it's a O(N) scan as above and you wind up with the set of all other characters that work with that the one you're processing.

After processing all these rows you now have an array containing 18 values representing the set of all characters that work with any one character. The operation took O(18N) time and O(18N) storage.

Query

Since you now have an array where for each character you have all possible characters that work with it, you simply look up each character in the query, and you intersect their sets (again, with bitwise-AND). A further intersection is required by using the set of all characters present in the query. That prunes off all characters that are not relevant.

This leaves you with a value which, for all values in the query, represents the set of all values that can result in a matching string. So if this value is then equal to the query then you have a match. Otherwise, you don't.

This is the part that is now fast. It has essentially reduced your query tests to constant time. However, the original indexing did cost us a lot of memory. What can be done about that?

Dynamic Programming

Is it really necessary to allocate all that storage for our frequency arrays?

Well actually, no. It was useful as a visual tool by laying out our counts in tabular form, and to explain the method conceptually but actually a lot of those values are not very useful most of the time and it sure made the method seem a bit complicated.

The good news is we can compute our master sets at the same time as computing character counts, without needing to store any frequency arrays. Remember that when we're computing the counts we use a histogram which is as simple as having one small 18-value array array where you say count[c] = 1 (if c is a character or an index derived from that character). So we could save a ton of memory if we just do the following:

Processing the set (mask) of all compatible characters for character c:

Initialize the mask for character c to all 1s (mask[c] = (1 << 18) - 1) -- this represents all characters are currently compatible. Initialize a character histogram (count) to all zero.
Walk through string 1 until you reach character c. For every character x you encounter along the way, increase its count in the histogram (count[x] ).
Walk through string 2 until you reach character c. For every character x you encounter along the way, decrease its count in the histogram (count[x]--).
Construct a 'good' set where any character that currently has a zero-count has a 1-bit, otherwise 0-bit. Intersect this with the current mask for c (using bitwise-AND): mask[c] &= good
Continue from step 2 until you have reached the end of both strings. If you reach the end of one of the strings prematurely, then the character count does not match and so you set the mask for this character to zero: mask[c] = 0
Repeat from 1 for every character, until all characters are done.

Above, we basically have the same time complexity of O(18N) except we have absolutely minimal extra storage -- only one array of counts for each character and one array of masks.

Combining techniques like the above to solve seemingly complex combinatorial problems really fast is commonly referred to as Dynamic Programming. We reduced the problem down to a truth table representing all possible characters that work with any single character. The time complexity remains linear with respect to the length of the strings, and only scales by the number of possible characters.

Here is the algorithm above hacked together in C : https://godbolt.org/z/PxzYvGs8q

CodePudding user response：

Let RESTRICT(s,q) be the restriction of string s to the letters in the set q.

If q contains more than two letters, then the full string RESTRICT(s,q) can be reconstructed from all the strings RESTRICT(s,q_ij) where q_ij is a pair of letters in q.

Therefore, RESTRICT(s1,q) = RESTRICT(s2,q) if and only if RESTRICT(s1,q_ij) = RESTRICT(s2,q_ij) for all pairs q_ij in q.

Since you are restricted to 18 letters, there are only 153 letter pairs, or only 171 fundamental single- or double-letter queries.

If you precalculate the results of these 171 queries, then you can answer any other more complicated query just by combining their results, without inspecting the string at all.