I am trying, but struggling, to write an algorithm that checks if a substring exists in a piece of text. The piece of text can include punctuation, but only alphanumeric characters should be considered when searching for the substring.
I would like to return the start and end index of the substring. It is guaranteed that the substring exists in the text. However, these indexes should also account for punctuation that was ignored in the search.
For example, for the text BD ACA;B_ 1E
and the substring AB1
, the algorithm should return 5
and 11
as the start and end index. (text[5:11]
-> A;B_ 1
== AB1
with punctuation removed.)
This is the best I have done so far.
def search(text, sub):
print(text, sub)
if not sub:
return True
for i, char in enumerate(text):
if char == sub[0]:
return search(text[1:], sub[1:])
else:
return search(text[1:], sub)
result = search("BD ACA;B_ 1E", "AB1")
print(result)
- I am printing out the two strings at the entry point to the function just to see how it progresses.
- If the first n chars of the substring are found, but not the rest, then the search doesn't restart to search for the entire substring again. This is what I was attempting to do with the for loop, but can't get it working. This means that it will return true if call characters of the substring simply exist somewhere, in order, in the text.
- I have just tried to initially get the search working successfully, I haven't even tried to track and return the start and end indexes yet.
CodePudding user response:
You can use a regular expression for that. [\W_]*
will match any non-alphanumeric character sequence, so you could alternate the letters of the search string with that pattern:
import re
def search(text, sub):
match = re.search(r"[\W_]*".join(sub), text)
return match and match.span(0)
text = "BD ACA;B_ 1E"
span = search(text, "AB1")
if span:
start, end = span
print(start, end)
print(text[start:end])
CodePudding user response:
There is a string function isalnum() that can check for alphanumeric chars.
t = 'BD ACA;B_ 1'
s = 'AB1'
def is_in_st(t, s):
start = t.rfind(s[0])
end = t.rfind(s[-1]) 1
if start and end:
return s == ''.join([c for c in t[start:end] if c.isalnum()])
is_in_st(t, s)
True