What is the fastest way to check if a substring is in a string as an entire word or term, like RegEx-CodePudding

I have a problem to find the fastest way to check if a substring is in a string as an entire word or term. Currently, I'm using RegEx, but I need to perform thousands of verifications and RegEx is being VERY slow.

There are many ways to respond to this. The easier way to verify is substring in string:

substring = "programming"
string = "Python is a high-level programming language"

substring in string

>>> True

In other hand, it's a naivy solution when we need to find the substring as an entire word or term:

substring = "program"
string = "Python is a high-level programming language"

substring in string

>>> True

Another solution is to split the string into a list of words and verify if the substring is in that list:

substring = "program"
string = "Python is a high-level programming language"

substring in string.split()

>>> False

Nevertheless, it doesn't work if the substring is a term. To resolve this, another solution would be to use RegEx:

import re

substring = "high-level program"
string = "Python is a high-level programming language"

re.search(r"\b{}\b".format(substring), string) != None

>>> False

However, my biggest problem is that the solution is REALLY slow if you need to perform thousands of verifications.

To mitigate this issue, I created some approaches that, although they are faster than RegEx (for the use I need), still are a lot slower than substring in string:

substring = "high-level program"
string = "Python is a high-level programming language"

all([word in string.split() for word in substring.split()])

>>> False

Although simple, the above approach didn't fit because it ignores substring word order, returning True if the substring was "programming high-level", unlike the solution in RegEx. So, I created another approach verifying if the substring is in a ngram list where each ngram has the same number of words as the substring:

from nltk import ngrams

substring = "high-level program"
string = "Python is a high-level programming language"

ngram = list(ngrams(string.split(), len(substring.split())))

substring in [" ".join(tuples) for tuples in ngram]

>>> False

Someone knows some a faster approach to find a substring inside a string as an entire word or term?

CodePudding user response：

Simply loop through the string and splice the string according to the substring length and compare the splice string with the substring if it is equal return True.

Illustration*

strs = "Coding"
substr = "ding"
slen = 4
i = 0

check = strs[i:slen i]==substr

# 1st iteration
strs[0:4 0] == ding
codi == ding # False

# 2nd iteration
i=1
strs[1:4 1] == ding
odin == ding # False

# 3rd iteration
i=2
strs [2:4 2] == ding
ding == ding # True

Solution

def str_exist(string, substring, slen):
    for i in range(len(string)):
        if string[i:slen i] == substring:
             return True
    return False

substring = "high-level program"
string = "Python is a high-level programming language"
slen = len(substring)

print(str_exist(string, substring, slen))

OUTPUT

True

CodePudding user response：

Check this out. I've added comments in my code for better understanding of what this algorithm is doing.

def check_substr(S: str, sub_str: str) -> bool:
  """
  This function tells whether the given sub-string 
  in a string is present or not.
  
  Parameters
  S:       str: The original string
  sub_str: str: The sub-string to be checked
  
  Returns
  result: boolean: Whether the string is present or not
  """
  i = 0
  pointer = 0
  
  while (i < len(S)):
    # This means that we are already in that word
    # whose sub-part is already matched. For eg:
    # `program` in `programming`. Therefore we are
    # going to skip the rest of the word and check
    # the next word instead.
    if (S[i] != ' ' and pointer == len(sub_str)):
      while (S[i] != ' '):
        i  = 1
      i  = 1
      pointer = 0
    
    # If we encounter a space, we check whether we
    # have already found the sub-string or not.
    elif (S[i] == ' ' and pointer == len(sub_str)):
      return True
      
    if (S[i] == sub_str[pointer]):
      pointer  = 1
      
    else:
      # If the current element of the original 
      # string matched with the first element of
      # the sub-string then we increment the 
      # pointer by 1. Otherwise we set it to 0.
      pointer = 1 if (S[i] == sub_str[0]) else 0
      
    i  = 1
  
  return pointer == len(sub_str)
  
S = "Python is a high-level programming language"
print(check_substr(S, "program"))
print(check_substr(S, "programming language"))

Output

False
True

Time Complexity

O(n)

Edits:

As @PGHE pointed out in the comments, we can also do the checking in punctuation characters and not only in spaces. Since the OP hasn't mentioned anything about the punctuation, I'm keeping this answer as it is.