Home > OS >  PHP Regex: Remove words not equal exactly 3 characters
PHP Regex: Remove words not equal exactly 3 characters

Time:09-16

An excellent "very close" answer at Remove words less than 3 chars with DEMO where Regex

\b([a-z]{1,2})\b

removes all words less than 3 chars.

But how to reset this demo vise versa ? to remove all words NOT EXACTLY EQUAL 3 chars ? We can catch word where exactly 3 chars by

\b([a-z]{3})\b

but how to tell regex - remove all other words what are NOT equal to 3 ?

So in regex demo ref above should leave only word 'and'

CodePudding user response:

Use alternatives to match either 1-2 or 4 letters.

\b(?:[a-z]{1,2}|[a-z]{4,})\b

CodePudding user response:

Another variation with a negative lookbehind asserting not 3 chars to the left

\b[a-z] \b(?<!\b[a-z][a-z][a-z]\b)

Regex demo

Or with a skip fail approach for 3 chars a-z:

\b[a-z]{3}\b(*SKIP)(*F)|\b[a-z] \b

Regex demo

CodePudding user response:

I think maybe:

\b(?![a-z]{3}\b)[a-z] \b

Matching:

  • \b - A word-boundary.
  • (?![a-z]{3}\b) - A negative lookahead to avoid three-letter words.
  • [a-z] \b - Any 1 letter-words (greedy) us to a word boundary.

Another trick is to use a capture group to match what you want:

\b(?:[a-z]{3}|([a-z] ))\b
  • \b - A word-boundary
  • (?:[a-z]{3}|([a-z] )) - A nested capture group inside alternation to first neglect three alpha chars and capture any 1 words (greedy).
  • \b - A word-boundary

CodePudding user response:

With an optional group of letters with at least 2 characters and a possessive quantifier:

\b[a-z]{1,2} (?:[a-z]{2,})?\b

demo

This approach is based on a calculation trick and on backtracking.
In other words: 2 x = 3 with x > 1 has no solution.

If I had written \b[a-z]{1,2}(?:[a-z]{2,})?\b (with or without the last \b it isn't important), when the regex engine reaches the position at the start of a three letters word [a-z]{1,2} would have consumed the two first letters, but as an extra character is needed for the last word boundary to succeed, the regex engine doesn't have an other choice to backtrack the {1,2} quantifier. With one backtracking step, the [a-z]{1,2} would have consumed only one character and (?:[a-z]{2,})?\b could have succeeded. But by making this quantifier possessive I forbid this backtracking step. Since, for a three letters word, [a-z]{1,2} takes 2 characters and [a-z]{2,} needs at least 2 letters, the pattern fails.


Use the word boundary and force to fail with the possessive quantifier:

\b(?:[a-z]{3}\b)? [a-z] 

demo

This one plays also with an impossible assertion: three letters followed by a word boundary, can't be followed by a letter.

One more time, with a three letter words, once the three letters are consumed by [a-z]{3}, the possessive quantifier ? forbids to backtrack and [a-z] makes the pattern fail.


Force to fail with 3 letters and skip them using a backtracking control verb:

\b[a-z]{3}\b(*SKIP)^|[a-z] 

demo

  • Related