I am working on a coding problem in which I have to delete all occurrences of a substring T in a string S (keeping in mind that removing one occurrence of T in S may generate a new occurrence of T), and then to return the resulting string S after all deletions. The size of both S and T can be up to 10^6.
For example, if I have S = "aabcbcd" and T = "abc", then removing all occurrences of abc in S results in S = "d".
The sample solution to this problem involves building a string R from S one character at a time, and whenever the end of R matches T, we delete it from R (the comparison between the end of R and T is determined by string hashing).
The solution says that
Since this deletion is at the end of R this is just a simple O(1) resize operation.
However, according to https://m.cplusplus.com/reference/string/string/resize/ the time complexity of string::resize is linear in the new string length. Ben Voigt confirms this in Why is string::resize linear in complexity?.
Also, in the solution the code involves using string::substr to double check if the end of R and T match (since hash(the end of R)==hash(T) does not guarantee the end of R equals to T):
/* If the end of R and T match truncate the end of R (and associated hash arrays). */
if (hsh == thsh && R.substr(R.size() - T.size()) == T) {
//...
}
Once again, https://m.cplusplus.com/reference/string/string/substr/ says that string::substr has linear time complexity.
Even if string::substr wasn't linear, then comparing the two strings directly would still cause the comparison to be linear in the size of T.
If this is true, wouldn't the time complexity of the solution be at least O(S.length()*T.length()), instead of O(S.length()) (according to the solution)? Any help is appreciated!
CodePudding user response:
string::resize
isn't always linear. If you're expanding a string, it's linear on the number of characters copied, which is potentially the total number in the resulting string (but could be less, if the string already has enough space for the character(s) you add, so it only has to write the new characters).
Using resize
to reduce the size of a string will normally take constant time. In simplified form (and Leaving out a lot of other "stuff") string
can look something like this:
template <class T>
class string {
char *data;
size_t allocated_size;
size_t in_use_size;
public:
void resize(size_t new_size) {
if (new_size < in_use_size) {
in_use_size = new_size;
data[in_use_size] = '\0';
} else {
// code to expand string to desired size in O(n) time
}
}
// ...
};
So although it'll be linear when expanding the string, it'll typically have constant complexity when reducing the size.
As for using substr
, yes, in the case where the hashes match, substr
itself will be linear (it creates a new string
object) and you're going to do a linear-complexity comparison. I'd guess they're pretty much just presuming hash collisions are rare enough to ignore, so for most practical purposes, this only happens when you have an actual match.
CodePudding user response:
Since Jerry Coffin already gave you a really good answer to the rest of your question, I will take a stab at:
Even if string::substr wasn't linear, then comparing the two strings directly would still cause the comparison to be linear in the size of T.
If this is true, wouldn't the time complexity of the solution be at least O(S.length()*T.length()), instead of O(S.length()) (according to the solution)? Any help is appreciated!
The justification here is stated (per your link) as:
It's ok (and preferable) to do a full string comparison in case the hashes match to ensure that the strings really do match. Since they do with high probability we can amortize this against the length of R that we'll delete.
So what does it mean to "amortize" this part? Because you are right, the string comparison is linear in T, and if the algorithm did a comparison for each element of S, then the time complexity would be O(TS).
The best way I have of thinking about amortization is to ask the question: "How many times do we touch each element?"
Again, if there was a T-character string comparison for each element of S, then we would touch element i once for each element of T as T slides across the string S. Then there might also be a deletion of i at some point, but unless you can say something smart about that, then that does not help much (if we know nothing, then either assume worst case: T 1 touches = O(T), or take the average: (T 1)/2 touches = O(T)) Which would mean that with S elements that are each touched O(T) times, total time complexity = O(TS).
But that is not the case here. Since we only do the comparison when there is a hash-match, and since they claim that false matches are so rare that we effectively only get a hash-match when there is an actual match (this claim may or may not be true, that would require an investigation into the hash algorithm), then the logic changes:
If we only do a string comparison when there is a match, and thus just before deleting the characters we compare against how many times do we now touch element i?
i is now touched once for a comparison, and then immediately after for a deletion - and then it is gone and never touched again. That makes it constant time per element O(2) = O(1), and thus only linear in the number of elements, total time complexity = O(S).