I'm the author of pythonizer, perl to python converter, and I'm trying to translate a perl split statement that has a string pattern that includes a backslash, and I need some help understanding the behavior. Here is the example based on the source code I'm trying to translate:
$s = 'a|b|c';
@a = split '\|', $s;
print scalar(@a) . "\n";
print "@a\n";
The output is:
3
a b c
Now if I just print '\|'
it prints \|
so I'm not sure why the backslash is being ignored in the string pattern. The documentation doesn't say much of anything about a string being used as a pattern, except for the ' '
special case. Feeding '\|'
to python string split will not split this string.
Even more strange is what happens if I change the above code to use a double-quoted string:
@a = split "\|", $s;
Then the output is:
5
a | b | c
If I change it to a regex, then it does the same thing as if it was a single-quoted string (splitting into 3 pieces), which makes perfect sense because |
is a special char in a regex so it needs to be escaped:
@a = split /\|/, $s;
So my question is - how is a split on a string that contains a backslash (in single and then double quotes) supposed to work so I can reproduce it in python? Should I just remove all backslashes, except for \\
from a single-quoted input string if it's on a split?
Also, why does a split on "\|"
(or "|"
) split the string into 5 pieces? (I'm thinking of punting on this case.)
CodePudding user response:
Perl treats a single quote as-is. It interpolates double-quotes.
Split expects regex, so the '\|'
is being treated as regex \|
, where the \
is a regex escape char, meaning the |
is the split char matched. Perl interpolates the "\|"
to just |
, which is regex for OR.
CodePudding user response:
There's a few interleaved questions so let me go step by step
Perl's split takes a regular expression pattern to identify separators by which it splits the string, in its first argument. This is a "normal" regex, compiled and ran by the regex engine, but in doing so it does have special cases
As for delimiters in
split
's regex: variables in patterns are interpolated except under single quotes, like in regex, what has no relevance for examples here. The string\|
is pattern\|
either way, so the literal|
(and not alternation)But for double quotes there is a difference: in
split
the string under the double quotes is first interpolated, apparently including escapes, and only then is the result handed to the regex engine to compile it into a pattern. So that"\|"
becomes the pattern|
for the regex. (Not the behavior in regex outside ofsplit
.)What brings us to the issue of
split
-ing with the pattern of|
, assplit "\|"
or assplit /|/
— that works like splitting with an empty string, asplit
's specialty which returns all characters. A regex doesn't behave that way, with/|/
nor with//
.This behavior of
split
appears undocumented. I can see a rationale like "split by either empty string or by empty string -- well, so split by empty string", what forsplit
perhaps makes some sense.In regex that doesn't make any sense as a pattern of an emtpy string has very distinct behavior, unrelated to what
split
does. Thus matching "empty string -or- empty string" doesn't make sense, and so the pattern of lone|
doesn't either; we get silly results. (I wonder why it even compiles?)
As for what to do with this for Python, the str.split mentioned by OP in a comment doesn't use regex at all. To reproduce Perl's split
operation one needs to use split from re, re.split(pattern, string,...)
. Then go through details and test behavior in re
with escaped regex patterns.