Home > Enterprise >  How to split text into "steps" using regex in perl?
How to split text into "steps" using regex in perl?

Time:11-06

I am trying to split texts into "steps" Lets say my text is

my $steps = "1.Do this. 2.Then do that. 3.And then maybe that. 4.Complete!" 

I'd like the output to be:

"1.Do this."
"2.Then do that."
"3.And then maybe that."
"4.Complete!"

I'm not really that good with regex so help would be great!

I've tried many combination like:

split /(\s\d.)/ 

But it splits the numbering away from text

CodePudding user response:

I would indeed use split. But you need to exclude the digit from the match by using a lookahead.

my @steps = split /\s (?=\d \.)/, $steps;

CodePudding user response:

All step-descriptions start with a number followed by a period and then have non-numbers, until the next number. So capture all such patterns

my @s = $steps =~ / [0-9] \. [^0-9]  /xg; 

say for @s;

This works only if there are surely no numbers in the steps' description, like any approach relying on matching a number (even if followed by a period, for decimal numbers)

If there may be numbers in there, we'd need to know more about the structure of the text.

Another delimiting pattern to consider is punctuation that ends a sentence (. and ! in these examples), if there are no such characters in steps' description and there are no multiple sentences

my @s = $steps =~ / [0-9] \. .*? [.!] /xg;

Augment the list of patterns that end an item's description as needed, say with a ?, and/or ." sequence as punctuation often goes inside quotes.

If this doesn't fit the text either (it may have multiple sentences?) then we'd really need a more precise description of what that text is like.


An approach using a "numbers-period" pattern to delimit item's description, like

/ [0-9] \. .*? (?=\s [0-9] \. | \z) /xg;

(or in a lookahead in split) fails with text like

1. Only $2.50   or   1. Version 2.4.1   ...


To include text like 1. Do "this." and 2. Or "that!" we'd want

/ [0-9] \. .*? (?: \." | !" | [.!?]) /xg;
  • Related