Home > database >  Explaining perl split with backslash in string pattern
Explaining perl split with backslash in string pattern

Time:11-06

I'm the author of pythonizer, perl to python converter, and I'm trying to translate a perl split statement that has a string pattern that includes a backslash, and I need some help understanding the behavior. Here is the example based on the source code I'm trying to translate:

$s = 'a|b|c';
@a = split '\|', $s;
print scalar(@a) . "\n";
print "@a\n";

The output is:

3
a b c

Now if I just print '\|' it prints \| so I'm not sure why the backslash is being ignored in the string pattern. The documentation doesn't say much of anything about a string being used as a pattern, except for the ' ' special case. Feeding '\|' to python string split will not split this string.

Even more strange is what happens if I change the above code to use a double-quoted string:

@a = split "\|", $s;

Then the output is:

5
a | b | c

If I change it to a regex, then it does the same thing as if it was a single-quoted string (splitting into 3 pieces), which makes perfect sense because | is a special char in a regex so it needs to be escaped:

@a = split /\|/, $s;

So my question is - how is a split on a string that contains a backslash (in single and then double quotes) supposed to work so I can reproduce it in python? Should I just remove all backslashes, except for \\ from a single-quoted input string if it's on a split?

Also, why does a split on "\|" (or "|") split the string into 5 pieces? (I'm thinking of punting on this case.)

CodePudding user response:

Perl treats a single quote as-is. It interpolates double-quotes.

Split expects regex, so the '\|' is being treated as regex \|, where the \ is a regex escape char, meaning the | is the split char matched. Perl interpolates the "\|" to just |, which is regex for OR.

CodePudding user response:

There's a few interleaved questions so let me go step by step

  • Perl's split takes a regular expression pattern to identify separators by which it splits the string, in its first argument. This is a "normal" regex, compiled and ran by the regex engine, but in doing so it does have special cases

  • As for delimiters in split's regex: variables in patterns are interpolated except under single quotes, like in regex, what has no relevance for examples here. The string \| is pattern \| either way, so the literal | (and not alternation)

    But for double quotes there is a difference: in split the string under the double quotes is first interpolated, apparently including escapes, and only then is the result handed to the regex engine to compile it into a pattern. So that "\|" becomes the pattern | for the regex. (Not the behavior in regex outside of split.)

  • What brings us to the issue of split-ing with the pattern of |, as split "\|" or as split /|/ — that works like splitting with an empty string, a split's specialty which returns all characters. A regex doesn't behave that way, with /|/ nor with //.

    This behavior of split appears undocumented. I can see a rationale like "split by either empty string or by empty string -- well, so split by empty string", what for split perhaps makes some sense.

    In regex that doesn't make any sense as a pattern of an emtpy string has very distinct behavior, unrelated to what split does. Thus matching "empty string -or- empty string" doesn't make sense, and so the pattern of lone | doesn't either; we get silly results. (I wonder why it even compiles?)

As for what to do with this for Python, the str.split mentioned by OP in a comment doesn't use regex at all. To reproduce Perl's split operation one needs to use split from re, re.split(pattern, string,...). Then go through details and test behavior in re with escaped regex patterns.

  • Related