Home > Mobile >  Perl regex: how to get multi-byte optional char (?) matching
Perl regex: how to get multi-byte optional char (?) matching

Time:06-17

With single-byte chars optional matching works:

~% perl -e 'print ("l" =~ /l?/u)'            
1%                                           
~% perl -e 'print ("l" =~ /l?l?/u)'          
1%                                           

With unicode (wide-byte) chars optional matching does not work

~% perl -e 'print ("д" =~ /д?/)'   
1%                                 
~% perl -e 'print ("д" =~ /д?д?/u)'
~%                                 

How to make it work? I've already added /u and I've tried use feature 'unicode_strings' to no avail. I assume perl sees д as multiple bytes and only applies ? to the last one.

CodePudding user response:

You have to tell Perl that the source is in UTF-8:

perl -Mutf8 -e 'print "д" =~ /д?д?/u'

See utf8 for details.

CodePudding user response:

By default, Perl expects source code provided to it to be ASCII. (String literals are 8-bit clean, meaning non-ASCII bytes are included as-is.) Using use utf8; tells it to expect UTF-8 instead.

$ perl -le'print "д" =~ /д?д?/u'


$ perl -le'print "\xD0\xB4" =~ /\xD0\xB4?\xD0\xB4?/u'         # Same as previous


$ perl -le'use utf8; print "д" =~ /д?д?/u'
1

$ perl -le'use utf8; print "\x{434}" =~ /\x{434}?\x{434}?/u'  # Same as previous
1

$ perl -Mutf8 -le'print "д" =~ /д?д?/u'                       # -Mutf8 == use utf8;
1
  • Related