Home > Blockchain >  perl regex to remove initial all-whitespace lines from a string: why does it work?
perl regex to remove initial all-whitespace lines from a string: why does it work?

Time:10-14

The regex s/\A\s*\n// removes every all-whitespace line from the beginning of a string. It leaves everything else alone, including any whitespace that might begin the first visible line. By "visible line," I mean a line that satisfies /\S/. The code below demonstrates this.

But how does it work?

\A anchors the start of the string

\s* greedily grabs all whitespace. But without the (?s) modifier, it should stop at the end of the first line, should it not? See https://perldoc.perl.org/perlre.

Suppose that without the (?s) modifier it nevertheless "treats the string as a single line". Then I would expect the greedy \s* to grab every whitespace character it sees, including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.

Nevertheless, the code does exactly what I want. Since I can't explain it, it's like a kludge, something that happens to work, discovered through trial and error. What is the reason it works?

#!/usr/bin/env perl 
use strict; use warnings;
print $^V; print "\n";

my @strs=(
    join('',"\n", "\t", ' ', "\n", "\t", ' dogs',),
    join('',
              "\n",
              "\n\t\t\x20",
              "\n\t\t\x20",
    '......so what?',
              "\n\t\t\x20",
    ),
);

my $count=0;
for my $onestring(@strs)
{
    $count  ;
    print "\n$count ------------------------------------------\n"; 
    print "|$onestring|\n";
    (my $try1=$onestring)=~s/\A\s*\n//;
    print "|$try1|\n";
}

CodePudding user response:

But how does it work?
...
I would expect the greedy \s* to grab every whitespace character it sees, including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.

Correct -- the \s* at first grabs everything up to the d (in dogs) and with that the match would fail ... so it backs up, a character at a time, shortening that greedy grab so to give a chance to the following pattern, here \n, to match.

And that works! So \s* matches up to (the last!) \n, that one is matched by the following \n in the pattern, and all is well. That's removed and we stay with "\tdogs" which is printed.

This is called backtracking. See about it also in perlretut. Backtracking can be suppressed, most notably by possesive forms, or rather by extended construct (?>...).


But without the (?s) modifier, it should stop at the end of the first line, should it not?

Here you may be confusing \s with ., which indeed does not match \n (without /s)

CodePudding user response:

There are two questions here.


The first is about the interaction of \s and (lack of) (?s). Quite simply, there is no interaction.

\s matches whitespaces characters, which includes Line Feed (LF). It's not affected by (?s) whatsoever.

(?s) exclusively affects ..

  • (?-s) causes . to match all characters except LF. [Default]
  • (?s) causes . to match all characters.

If one wanted to match whitespace on the current line, one could use \h instead of \s. It only matches horizontal whitespace, thus excluding CR and LF (among others).

Alternatively, (?[ \s - \n ])[1], [^\S\n][2] and \s(?<!\n)[3] all match whitespace characters other than LF.


The second is about a misconception of what greediness means.

Greediness or lack thereof doesn't affect if a pattern can match, just what it matches. For example, for a given input, /a / and /a ?/ will both match, or neither will match. It's impossible for one to match and not the other.

"aaaa" =~ /a /    # Matches 4 characters at position 0.
"aaaa" =~ /a ?/   # Matches 1 character  at position 0.

"bbbb" =~ /a /    # Doesn't match.
"bbbb" =~ /a ?/   # Doesn't match.

When something is greedy, it means it will match the most possible at the current position that allows the entire pattern to match. Take the following for example:

"ccccd" =~ /.*d/

This pattern can match by having .* match only cccc instead of ccccd, and thus does so. This is achieved through backtracking. .* initially matches ccccd, then it discovers that d doesn't match, so .* tries matching only cccc. This allows the d and thus the entire pattern to match.

You'll find backtracking used outside of greediness too. "efg" =~ /^(e|.f)g/ matches because it tries the second alternative when it's unable to match g when using the first alternative.

In the same way as .* avoids matching the d in the earlier example, the \s* avoids matching the LF and tab before dog in your example.


  1. Currently requires use experimental qw( regex_sets );, but I consider it safe.
  2. Less clear because it uses double negatives.
    [^\S\n]
    = A char that's ( not( not(\s) or LF ) )
    = A char that's ( not(not(\s)) and not(LF) )
    = A char that's ( \s and not LF )
  3. Less efficient, and far from as pretty as the regex set.
  • Related