I have a problem to find the fastest way to check if a substring is in a string as an entire word or term. Currently, I'm using RegEx, but I need to perform thousands of verifications and RegEx is being VERY slow.
There are many ways to respond to this. The easier way to verify is substring in string
:
substring = "programming"
string = "Python is a high-level programming language"
substring in string
>>> True
In other hand, it's a naivy solution when we need to find the substring as an entire word or term:
substring = "program"
string = "Python is a high-level programming language"
substring in string
>>> True
Another solution is to split the string into a list of words and verify if the substring is in that list:
substring = "program"
string = "Python is a high-level programming language"
substring in string.split()
>>> False
Nevertheless, it doesn't work if the substring is a term. To resolve this, another solution would be to use RegEx:
import re
substring = "high-level program"
string = "Python is a high-level programming language"
re.search(r"\b{}\b".format(substring), string) != None
>>> False
However, my biggest problem is that the solution is REALLY slow if you need to perform thousands of verifications.
To mitigate this issue, I created some approaches that, although they are faster than RegEx (for the use I need), still are a lot slower than substring in string
:
substring = "high-level program"
string = "Python is a high-level programming language"
all([word in string.split() for word in substring.split()])
>>> False
Although simple, the above approach didn't fit because it ignores substring word order, returning True
if the substring was "programming high-level"
, unlike the solution in RegEx. So, I created another approach verifying if the substring is in a ngram list where each ngram has the same number of words as the substring:
from nltk import ngrams
substring = "high-level program"
string = "Python is a high-level programming language"
ngram = list(ngrams(string.split(), len(substring.split())))
substring in [" ".join(tuples) for tuples in ngram]
>>> False
Someone knows some a faster approach to find a substring inside a string as an entire word or term?
CodePudding user response:
Simply loop through the string and splice the string according to the substring length and compare the splice string with the substring if it is equal return True.
Illustration*
strs = "Coding"
substr = "ding"
slen = 4
i = 0
check = strs[i:slen i]==substr
# 1st iteration
strs[0:4 0] == ding
codi == ding # False
# 2nd iteration
i=1
strs[1:4 1] == ding
odin == ding # False
# 3rd iteration
i=2
strs [2:4 2] == ding
ding == ding # True
Solution
def str_exist(string, substring, slen):
for i in range(len(string)):
if string[i:slen i] == substring:
return True
return False
substring = "high-level program"
string = "Python is a high-level programming language"
slen = len(substring)
print(str_exist(string, substring, slen))
OUTPUT
True
CodePudding user response:
Check this out. I've added comments in my code for better understanding of what this algorithm is doing.
def check_substr(S: str, sub_str: str) -> bool:
"""
This function tells whether the given sub-string
in a string is present or not.
Parameters
S: str: The original string
sub_str: str: The sub-string to be checked
Returns
result: boolean: Whether the string is present or not
"""
i = 0
pointer = 0
while (i < len(S)):
# This means that we are already in that word
# whose sub-part is already matched. For eg:
# `program` in `programming`. Therefore we are
# going to skip the rest of the word and check
# the next word instead.
if (S[i] != ' ' and pointer == len(sub_str)):
while (S[i] != ' '):
i = 1
i = 1
pointer = 0
# If we encounter a space, we check whether we
# have already found the sub-string or not.
elif (S[i] == ' ' and pointer == len(sub_str)):
return True
if (S[i] == sub_str[pointer]):
pointer = 1
else:
# If the current element of the original
# string matched with the first element of
# the sub-string then we increment the
# pointer by 1. Otherwise we set it to 0.
pointer = 1 if (S[i] == sub_str[0]) else 0
i = 1
return pointer == len(sub_str)
S = "Python is a high-level programming language"
print(check_substr(S, "program"))
print(check_substr(S, "programming language"))
Output
False
True
Time Complexity
O(n)
Edits:
As @PGHE pointed out in the comments, we can also do the checking in punctuation characters and not only in spaces. Since the OP hasn't mentioned anything about the punctuation, I'm keeping this answer as it is.