Home > Software engineering >  Regex to "normalize" usage of SPACE after . , : chars (and some exceptions)
Regex to "normalize" usage of SPACE after . , : chars (and some exceptions)

Time:11-14

I need to normalize some texts (product descriptions) in regard to the correct usage of .,,,: symbols (no space before and one space after)

The regex I've come up with is this:

$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', '$1 ', $variation['DESCRIPTION']);

The problem is that this matches three cases it shouldn't touch:

  • Any decimal number, like 5.5
  • Any thousand separator, like 4,500
  • A "fixed" phrase in Greek, ό,τι
  • The ellipsis symbol, ...

Especially for the numeric exception, I know it can be achieved with some negative lookahead/lookbehind but unfortunately I can't combine them in my current pattern.

This is a fiddle for you to check (the cases that shouldn't be matched are in lines 2, 3, 4).

EDIT: I just realized that I'd need the pattern to also exclude ellipsis ... from the matches!

Any help will be very much appreciated! TIA

CodePudding user response:

You can add two lookaheads containing lookbehinds:

\s*([:,.])(?!(?<=ό,)τι)(?!(?<=\d.)\d)(?!\s*<br\s*/>)\s*

See the regex demo. Note that I also added \s* to the last lookahead and swapped it with the consuming \s* to fail the match if there is <br/> after any zero or more whitespaces after the :, , or ..

Details

  • \s* - zero or more whitespaces
  • ([:,.]) - Group 1: a :, , or .
  • (?!(?<=ό,)τι) - fail the match if the next two chars are τι preceded with ό,
  • (?!(?<=\d.)\d) - fail the match if the next char is a digit preceded with a digit and any char (note that a . is enough since the [:,.] already match the char allowed/required, here, we just need to "jump" over that matched char)
  • (?!\s*<br\s*/>) - a negative lookahead that fails the match if there are zero or more whitespaces, <br, zero or more whitespaces, /> immediately to the right of the current location.
  • \s* - zero or more whitespaces.

CodePudding user response:

If Wiktor's lookaround-heavy pattern is too difficult for you to conceptualize/maintain/adapt, then perhap a match&ignore technique will be easier for you. Admittedly, Wiktor's pattern is optimized for performance.

Pattern:

~                        #starting pattern delimiter 
\s*                      #zero or more whitespaces
(?:                      #start non-capturing group #1
  (?:                    #start non-capturing group #2
    \.\d                 #float expression not requiring leading digits
    |                    #or
    \d{1,3}(?:,\d{3})    #number containing thousands separators
    |                    #or
    ό,τι                 #literal greek phrase
    |                    #or
    <br\s*/>             #html break tag
    |                    #or
    \.{3}                #three literal dots (ellipsis)
  )                      #end non-capturing group #2
  (*SKIP)(*FAIL)         #discard anything matched by group #2
  |                      #or
  ([:,.])                #capture group #1
)                        #end non-capturing group #1
\s*                      #zero or more whitespaces
~                        #ending pattern delimiter

As you wish to extend your pattern to include more disqualifying rules, just add another pipe and add a subpattern to match the unwanted substring.

Code: (Demo)

$text = <<<TEXT
Composition:80% Polyamide,   15% Elastane, 5% Wool.
Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER

What about $1,234,567.89?

ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time ,being a unique choice for those who want to stand out .Made of rubber.<br />- Softfoam floor<br />- Binding with laces

Specs:<br />&bull; Something<br /><br />&bull; Something else<br />&bull; One more

Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play.<br />It consists of a cardigan and trousers ,made of soft fabric and have rib cuffs and legs for a better fit.<br /><br />&bull; Normal fit<br /><br />&bull; Cardigan  :Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />&bull; Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry,there'll be ...more!
TEXT;

echo preg_replace(
         '~\s*(?:(?:\.\d |\d{1,3}(?:,\d{3}) |ό,τι|<br\s*/>|\.{3})(*SKIP)(*FAIL)|([:,.]))\s*~',
         '$1 ',
         $text
     );

Output:

Composition: 80% Polyamide, 15% Elastane, 5% Wool. Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER

What about $1,234,567.89?

ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time, being a unique choice for those who want to stand out. Made of rubber. <br />- Softfoam floor<br />- Binding with laces

Specs: <br />&bull; Something<br /><br />&bull; Something else<br />&bull; One more

Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play. <br />It consists of a cardigan and trousers, made of soft fabric and have rib cuffs and legs for a better fit. <br /><br />&bull; Normal fit<br /><br />&bull; Cardigan: Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />&bull; Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry, there'll be ...more!
  • Related