Home > Blockchain >  Check if a string matches the beginning of a regex
Check if a string matches the beginning of a regex

Time:12-03

I have many string to match against a regex. Many strings start with the same substring. To speed up my search, I would like to check whether the regex could match a string which begins with the common substring...

Example

I have a regex like for instance: /^(.[3e]|[o0] ) l $/ and many strings, like for instance these:

...
goo
goober
good
goodhearted
goodly
goods
goody
goof
goofball
google
goon
goose
...
held
helical
helices
helicopter
helipad
heliport
hell
help
hellion
helm
helmet
...

Half of the strings start with goo: I'd like to test whether goo is a valid beginning for a match. It's not (no string starting with goo can ever match that regex), thus I'd discard all those words at once.

The other half start with hel: I'd like to test whether hel is a valid beginning for a match. It is (some strings starting with hel may match that regex), thus I proceed testing those strings.

Is there any function to do this with a generic regex, without having to manually re-engineer it?

CodePudding user response:

With the data set given, filtering on the first 3 characters didn't speed up the processing. The overhead of shuffling the data is likely not worth it.

const test = `...
goo
goober
good
goodhearted
goodly
goods
goody
goof
goofball
google
goon
goose
...
held
helical
helices
helicopter
helipad
heliport
hell
help
hellion
helm
helmet
...`;

const re1 = /^(.[3e]|[o0] ) l $/;

const t0 = performance.now();

// split the test string into an array
let arr1 = test.split('\n');

// create a set to hold the first three letters of each string in the array
const firstThree = new Set();
arr1.forEach(e => {
  firstThree.add(e.substring(0, 3));
});

// loop through the set
for (const ft of firstThree) {
  // check if the first three characters match the regular expression
  if (!re1.test(ft)) {
    // if not, remove those strings from the array
    arr1 = arr1.filter(e => e.indexOf(ft) !== 0);
  }
}

arr1 = arr1.filter(e => re1.test(e));
const t1 = performance.now();
console.log(`Took ${t1 - t0} milliseconds.`);

console.log(arr1);

const re2 = /^(?:.[3e]|[o0] ) l $/mg;

const t2 = performance.now();
const arr2 = [...test.matchAll(re2)];
const t3 = performance.now();
console.log(`Took ${t3 - t2} milliseconds.`);

console.log(arr2[0]);

CodePudding user response:

Yes, it is possible to check whether a regex could match a string that starts with a specific substring. You can use the re.match() function in the re module to check whether a string matches a regex at the start of the string.

Here is an example of how you can use this function to check whether a regex matches a string that starts with a specific substring:

import re

# The regex that you want to test
regex = r"^(.[3e]|[o0] ) l $"

# The common substring that you want to check
common_substring = "goo"

# Check whether the regex matches the common substring at the start of the string
if re.match(regex, common_substring):
    # The regex matches the common substring at the start of the string
    # Proceed to test the strings that start with the common substring
    # ...
else:
    # The regex does not match the common substring at the start of the string
    # Discard all the strings that start with the common substring
    # ...

Note that the re.match() function only checks whether the regex matches at the start of the string. If the regex could potentially match further down in the string, this function will not detect that. For example, if the regex is r"\d " and the common substring is "12345", re.match() will return a match object even though the regex could also potentially match the substring "45678" later in the string. In this case, you can use the re.search() function instead of re.match(). The re.search() function will check the entire string for a match, rather than just the start of the string. Here is an example:

import re

# The regex to match against
regex = r"^(.[3e]|[o0] ) l $"

# The common substring to check
common = "hel"

# Check if the common substring could potentially match the regex
if re.search(regex, common):

# If it could, proceed to test each individual string
strings = ["held", "helical", "helices", "helicopter", "helipad", "heliport", "hell", "help", "hellion", "helm", "helmet"]
for string in strings:
# Check if each string matches the regex
if re.match(regex, string):
print(f"'{string}' matches the regex")
else:
# If the common substring does not match the regex, we can discard all the strings at once
print(f"The common substring '{common}' does not match the regex, so none of the strings can match it either")
  • Related