I want to split a string as per the parameters laid out in the title. I've tried a few different things including using preg_match with not much success so far and I feel like there may be a simpler solution that I haven't clocked on to.
I have a regex that matches the "price" mentioned in the title (see below).
/(?=.)\£(([1-9][0-9]{0,2}(,[0-9]{3})*)|[0-9] )?(\.[0-9]{1,2})?/
And here are a few example scenarios and what my desired outcome would be:
Example 1:
input: "This string should not split as the only periods that appear are here £19.99 and also at the end."
output: n/a
Example 2:
input: "This string should split right here. As the period is not part of a price or at the end of the string."
output: "This string should split right here"
Example 3:
input: "There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price"
output: "There is a price in this string £19.99, but it should only split at this point"
CodePudding user response:
I suggest using
preg_split('~\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9] )?(?:\.\d{1,2})?(*SKIP)(*F)|\.(?!\s*$)~u', $string)
See the regex demo.
The pattern matches your pattern, \£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9] )?(?:\.\d{1,2})?
and skips it with (*SKIP)(*F)
, else, it matches a non-final .
with \.(?!\s*$)
(even if there is trailing whitespace chars).
If you really only need to split on the first occurrence of the qualifying dot you can use a matching approach:
preg_match('~^((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9] )?(?:\.\d{1,2})?|[^.]) )\.(.*)~su', $string, $match)
See the regex demo. Here,
^
- matches a string start position((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9] )?(?:\.\d{1,2})?|[^.]) )
- one or more occurrences of your currency pattern or any one char other than a.
char\.
- a.
char(.*)
- Group 2: the rest of the string.
CodePudding user response:
You could simply use this regex:
\.
Since you only have a space after the first sentence (and not a price), this should work just as well, right?
CodePudding user response:
To split a text into sentences avoiding the different pitfalls like dots or thousand separators in numbers and some abbreviations (like etc.
), the best tool is intlBreakIterator
designed to deal with natural language:
$str = 'There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price';
$si = IntlBreakIterator::createSentenceInstance('en-US');
$si->setText($str);
$si->next();
echo substr($str, 0, $si->current());
IntlBreakIterator::createSentenceInstance
returns an iterator that gives the indexes of the different sentences in the string.
It takes in account ?
, !
and ...
too. In addition to numbers or prices pitfalls, it works also well with this kind of string:
$str = 'John Smith, Jr. was running naked through the garden crying "catch me! catch me!", but no one was chasing him. His psychatre looked at him from the window with a circumspect eye.';
More about rules used by IntlBreakIterator
here.