Home > database >  Regular expression to match words in any order but words can be optional
Regular expression to match words in any order but words can be optional

Time:02-22

I'm looking around for a quite time and couldn't get a regular expression which fulfills my requirement

I have multiple lines of text something like this:

male positive average
average negative female
good negative female
female bad
male average
male
female
...
...

In above example there are three groups of words (male|female), (good|average|bad), and (positive|negative)

I want to capture each group of words in a named group: gender, quality, and feedback respectively.

The closest I reached is:

(?=.*(?P<gender>\b(fe)?male\b))(?=.*(?P<quality>(green|amber|red)))(?=.*(?P<feedback>(positive|negative))).*

Which matches the groups: gender, quality, and feedback in any order.

But it doesn't match or/and create a named group for the following type of sentences:

female green
positive male
positive female
female bad
male average
male
female

Note: The gender (male|female) is common and will be there in every line. Also, just for simplicity, here it's been mentioned just three different groups. Based on requirements it can even grow more.

Any help will be greatly appreciated.

CodePudding user response:

You need to anchor your regex to the beginning of a line (^) and make each of the positive lookaheads that contain named capture groups optional.

Also, you have some numbered capture groups that could be non-capture groups, which would be less confusing since you are only interested in the named capture groups. Lastly, you a missing some word boundaries.

I suggest you change your expression to the following.

^(?=.*(?P<gender>\b(?:fe)?male\b))?(?=.*(?P<quality>\b(?:green|amber|red)\b))?(?=.*(?P<feedback>\b(?:positive|negative)\b))?.*

Demo

The regular expression can be broken down as follows.

^                         # match beginning of line
(?=                       # begin positive lookahead
  .*                      # match zero or more characters
  (?P<gender>             # begin named capture group 'gender'
    \b                    # match a word boundary
    (?:female|male)       # one of the two words 
    \b                    # match a word boundary
  )                       # end capture group 'gender'
)?                        # end positive lookahead and make it optional
(?=                       # begin positive lookahead
  .*                      # match zero or more characters
  (?P<quality>            # begin named capture group 'quality'
    \b                    # match a word boundary
    (?:green|amber|red)   # match one of the three words
    \b                    # match a word boundary
  )                       # end named capture group 'quality'
)?                        # end positive lookahead and make it optional
(?=                       # begin positive lookahead
  .*                      # match zero or more characters
  (?P<feedback>           # begin named capture group 'feedback'    
    \b                    # match a word boundary
    (?:positive|negative) # match one of the two words
    \b                    # match a word boundary
  )                       # end named capture group 'feedback'
)?                        # end positive lookahead and make it
.*                        # match zero or more characters (the line)
  • Related