Home > database >  How to make regular expression match a required text?
How to make regular expression match a required text?

Time:12-08

I have a regular expression that I have written,

regex = "car\S*\s*(\w \s ){1,2}\s*\S*wash"

This regex match the texts such as (i.e., one or two words between "car" and "wash"),

"car. was good ?wash"
"car wash"
"car will never wash"

But I want the above regex to also match these variation of texts,

texts = [
"Car, not ... (?!) wash",   # (i.e., this should match because only one words between car and wash but has any number punctuations in between)
"Car never:)... $@@! with wash", # (i.e., this also should match because only two words between car and wash but has any more punctuations in between)
"Car, was never wash",
"Car...:) things, not wash"]

But the regex I have written is failing? How can I modify the regex I wrote to make it match all the above texts given,

import re

# Define the regular expression
regex = "car\S*\s*(\w \s ){1,2}\s*\S*wash"

# Use the re.search() function to find a match
match = re.search(regex, "Car, not ... (?!) wash", flags=re.I)

# Check if a match was found
if match:
    print("Match found: ", match.group(0))
else:
    print("No match found")

In short, I have to match any text that start with "car" and end with "wash" but with conditions.

  • It can have only 1 to 2 words in between car and wash. The regex I wrote take care of that issue.
  • Along with those N words, it can have any number of punctuation's or spaces between them.

CodePudding user response:

Based on my interpretation of your question plus additional comments, I am defining the following rules.

  1. A word is a sequence of one or more non-whitespace characters, at least one of which must be a letter, digit or underscore (the latter as it is part of the character class \w).
  2. Words are separated by at least one whitespace character, mixed with zero or more 'punctuation' characters (i.e. anything but letter/digit/underscore).

There is some ambiguity in my definition; punctuation between a letter and a space could be considered part of the word (rule #1) or part of the separation between words (rule #2). But when counting words, that makes no difference.

From there, I can build two subpatterns.

  • \S*\w\S* - a word has at least one word character, and no whitespace
  • \W*\s\W* - a separator has at least one whitespace character, and no word character

Chaining the subpatterns:

\bcar\W*\s\W*(\S*\w\S*\W*\s\W*){1,2}wash\b

Notice the word boundaries \b on either side, to prevent "scar" and "washing" to be mistaken for "car" and "wash".

This matches all of these texts:

car. was good ?wash            # 2 words and punctuation between car and wash
car will never wash            # 2 words
Car, not ... (?!) wash         # 1 word
Car never:)... $@@! with wash  # 2 words
Car, was never wash            # 2 words
Car...:) things, not wash      # 2 words

An alternate approach would be to first strip all punctuation from the string, and then match against \bcar\s (\S \s ){1,2}wash\b

CodePudding user response:

So you want to match anything that starts with "car" and ends with "wash" and may contain any number of characters in between. If that is the case, then I'm sure regex = "car.*wash" will work the same...

regex = "car.*?wash"

# Use the re.search() function to find a match
match = re.search(regex, "Car, was never wash", flags=re.I)

# Check if a match was found
if match:
    print("Match found: ", match.group(0))
else:
    print("No match found")
  • Related