I am trying to split texts into "steps" Lets say my text is
my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!"
I'd like the output to be:
"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"
I'm not really that good with regex so help would be great!
I've tried many combination like:
split /(\s\d.)/
But it splits the numbering away from text
CodePudding user response:
I would indeed use split
. But you need to exclude the digit from the match by using a lookahead.
my @steps = split /\s (?=\d \.)/, $steps;
CodePudding user response:
All step-descriptions start with a number followed by a period and then have non-numbers, until the next number. So capture all such patterns
my @s = $steps =~ / [0-9] \. [^0-9] /xg;
say for @s;
This works only if there are surely no numbers in the steps' description, like any approach relying on matching a number (even if followed by a period, for decimal numbers)†
If there may be numbers in there, we'd need to know more about the structure of the text.
Another delimiting pattern to consider is punctuation that ends a sentence (.
and !
in these examples), if there are no such characters in steps' description and there are no multiple sentences
my @s = $steps =~ / [0-9] \. .*? [.!] /xg;
Augment the list of patterns that end an item's description as needed, say with a ?
, and/or ."
sequence as punctuation often goes inside quotes.‡
If this doesn't fit the text either (it may have multiple sentences?) then we'd really need a more precise description of what that text is like.
† An approach using a "numbers-period" pattern to delimit item's description, like
/ [0-9] \. .*? (?=\s [0-9] \. | \z) /xg;
(or in a lookahead in split
) fails with text like
1. Only $2.50
or 1. Version 2.4.1
...
‡ To include text like 1. Do "this."
and 2. Or "that!"
we'd want
/ [0-9] \. .*? (?: \." | !" | [.!?]) /xg;