I need to normalize some texts (product descriptions) in regard to the correct usage of .
,,
,:
symbols (no space before and one space after)
The regex I've come up with is this:
$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', '$1 ', $variation['DESCRIPTION']);
The problem is that this matches three cases it shouldn't touch:
- Any decimal number, like 5.5
- Any thousand separator, like 4,500
- A "fixed" phrase in Greek,
ό,τι
- The ellipsis symbol,
...
Especially for the numeric exception, I know it can be achieved with some negative lookahead/lookbehind but unfortunately I can't combine them in my current pattern.
This is a fiddle for you to check (the cases that shouldn't be matched are in lines 2, 3, 4).
EDIT: I just realized that I'd need the pattern to also exclude ellipsis ...
from the matches!
Any help will be very much appreciated! TIA
CodePudding user response:
You can add two lookaheads containing lookbehinds:
\s*([:,.])(?!(?<=ό,)τι)(?!(?<=\d.)\d)(?!\s*<br\s*/>)\s*
See the regex demo. Note that I also added \s*
to the last lookahead and swapped it with the consuming \s*
to fail the match if there is <br/>
after any zero or more whitespaces after the :
, ,
or .
.
Details
\s*
- zero or more whitespaces([:,.])
- Group 1: a:
,,
or.
(?!(?<=ό,)τι)
- fail the match if the next two chars areτι
preceded withό,
(?!(?<=\d.)\d)
- fail the match if the next char is a digit preceded with a digit and any char (note that a.
is enough since the[:,.]
already match the char allowed/required, here, we just need to "jump" over that matched char)(?!\s*<br\s*/>)
- a negative lookahead that fails the match if there are zero or more whitespaces,<br
, zero or more whitespaces,/>
immediately to the right of the current location.\s*
- zero or more whitespaces.
CodePudding user response:
If Wiktor's lookaround-heavy pattern is too difficult for you to conceptualize/maintain/adapt, then perhap a match&ignore technique will be easier for you. Admittedly, Wiktor's pattern is optimized for performance.
Pattern:
~ #starting pattern delimiter
\s* #zero or more whitespaces
(?: #start non-capturing group #1
(?: #start non-capturing group #2
\.\d #float expression not requiring leading digits
| #or
\d{1,3}(?:,\d{3}) #number containing thousands separators
| #or
ό,τι #literal greek phrase
| #or
<br\s*/> #html break tag
| #or
\.{3} #three literal dots (ellipsis)
) #end non-capturing group #2
(*SKIP)(*FAIL) #discard anything matched by group #2
| #or
([:,.]) #capture group #1
) #end non-capturing group #1
\s* #zero or more whitespaces
~ #ending pattern delimiter
As you wish to extend your pattern to include more disqualifying rules, just add another pipe and add a subpattern to match the unwanted substring.
Code: (Demo)
$text = <<<TEXT
Composition:80% Polyamide, 15% Elastane, 5% Wool.
Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER
What about $1,234,567.89?
ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time ,being a unique choice for those who want to stand out .Made of rubber.<br />- Softfoam floor<br />- Binding with laces
Specs:<br />• Something<br /><br />• Something else<br />• One more
Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play.<br />It consists of a cardigan and trousers ,made of soft fabric and have rib cuffs and legs for a better fit.<br /><br />• Normal fit<br /><br />• Cardigan :Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />• Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry,there'll be ...more!
TEXT;
echo preg_replace(
'~\s*(?:(?:\.\d |\d{1,3}(?:,\d{3}) |ό,τι|<br\s*/>|\.{3})(*SKIP)(*FAIL)|([:,.]))\s*~',
'$1 ',
$text
);
Output:
Composition: 80% Polyamide, 15% Elastane, 5% Wool. Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER
What about $1,234,567.89?
ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time, being a unique choice for those who want to stand out. Made of rubber. <br />- Softfoam floor<br />- Binding with laces
Specs: <br />• Something<br /><br />• Something else<br />• One more
Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play. <br />It consists of a cardigan and trousers, made of soft fabric and have rib cuffs and legs for a better fit. <br /><br />• Normal fit<br /><br />• Cardigan: Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />• Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry, there'll be ...more!