Match a string ONLY if it appears in 3 different places (with a Regex)-CodePudding

Could I get help to write a regex that will only highlight words like those in the following list if one (or more) of them if they appear three or more times? I've been trying to use {3 ,] and without success.

BP, Exxon, Chevron, Equinor, TotalEnergies, Shell

E.g.

"Bp is a company. Bp sells oil. Bp drills oil." --> This would yield a match

"Bp is a company. Bp sells oil. Exxon is a competitor." --> This would not yield a match.

CodePudding user response：

You want something like:

\b(BP|Exxon|Chevron|Equinor|TotalEnergies|Shell)\b(?:.*?\b\1\b){2}

You can drop the \b if you want to allow subwords.

CodePudding user response：

You could do this in two steps.

Step 1: identify a word that appears at least thrice

Based on your examples I've made the assumption that the repeating word must be comprised of a capital letter followed by zero or more lowercase letters. You may of course change that in the obvious way to meet your requirements. The regular expression to match is

^(?=.*\b([A-Z][a-z]*)\b(?:.*\b\1\b){2})

If word meeting requirements appears at least thrice there will be a zero-width match of the string and that word will be saved to capture group 1.

For the following three strings I've marked the word captured to group 1. Where no word is captured (the third example) there is no word meeting requirements that appears at least thrice.

Yes, Bp is a company. Bp sells oil. Bp drills oil.
     ^^
Ab and Cd are big, Ab and Cd are not, Cd and Ef and Cde are quick.
       ^^
Yes, Bp is a company. Bp sells oil. Exxon is a competitor.

Demo

Note that had I not required the matching word to begin with a capital letter 'and' would have been matched in the second example.

The regular expression can be broken down as follows.

^
(?=              # begin a positive lookahead
  .*             # match >= 0 characters other than line terminators
  \b             # match a word boundary
  (              # begin capture group 1
    [A-Z][a-z]*  # match an uppercase letters followed by >= 0 lowercase letters    
  )              # end capture group 1
  \b             # match a word boundary
  (?:            # begin a non-capture group
    .*           # match >= 0 characters other than line terminators
    \b\1\b       # match the content of capture group 1 surrounded by word boundaries
  ){2}           # end the non-capture group and execute it twice
)                # end the positive lookahead

Step 2: Match all instances of the repeating word in the string

For example, having identified 'Cd' as the repeating word in the second test string we would match that test string with the regular expression

\bCd\b

Demo