How can I escape a string in Perl for LDAP searching?-CodePudding

I want to escape a string, per RFC 4515. So, the string "u1" would be transformed to "\75\31", that is, the ordinal value of each character, in hex, preceded by backslash.

Has to be done in Perl. I already know how to do it in Python, C , Java, etc., but Perl if baffling.

Also, I cannot use Net::LDAP and I may not be able to add any new modules, so, I want to do it with basic Perl features.

CodePudding user response：

In short, one way is just as the question says: split the string into characters, get their ordinals then convert format to hex; then stick them back together. I don't know how to get the \nn format so I'd make it 'by hand'. For instance

my $s = join '', map { sprintf '\%x', ord }  split //, 'u1';

Or use vector flag %v to treat the string as a "vector" of integers

my $s = sprintf '\%*vx', '\\', 'u1';

This can also be done with pack unpack, see the link below. Also see that page if there is a wide range of input characters.^†

See ord and sprintf, and for more pages like this one.

^† If there is non-ASCII input then you may need to encode it so to get octets, if they are to escape (and not whole codepoints)

use Encode qw(encode);

my $s = sprintf '\%*vx', '\\', encode('UTF_8', $input);

See the linked page for more.

CodePudding user response：

Skimming through RFC 4515, this encoding escapes the individual octets of multi-byte UTF-8 characters, not codepoints. So, something that works with non-ASCII text too:

#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;

sub valueencode ($) {
    # Unpack format returns octets of UTF-8 encoded text
    my @bytes = unpack "U0C*", $_[0];
    sprintf '\x' x @bytes, @bytes;
}

say valueencode 'u1';
say valueencode "Lu\N{U 010D}i\N{U 0107}"; # Lučić, from the RFC 4515 examples

Example:

$ perl demo.pl
\75\31
\4c\75\c4\8d\69\c4\87

Or an alternative using the vector flag:

use Encode qw/encode/;

sub valueencode ($) {
    sprintf '\%*vx', "\\", encode('UTF-8', $_[0]);
}

Finally, a smarter version that only escapes ASCII characters when it has to (And multi-byte characters, even though upon a closer read of the RFC they don't actually need to be if they're valid UTF-8):

# Encode according to RFC 4515 valueencoding grammar rules:
#
# Text is UTF-8 encoded. Bytes can be escaped with the sequence
# \XX, where the X's are hex digits.
#
# The characters NUL, LPAREN, RPAREN, ASTERISK and BACKSLASH all MUST
# be escaped.
#
# Bytes > 0x7F that aren't part of a valid UTF-8 sequence MUST be
# escaped. This version assumes there are no such bytes and that input
# is a ASCII or Unicode string.
#
# Single bytes and valid multibyte UTF-8 sequences CAN be escaped,
# with each byte escaped separately. This version escapes multibyte
# sequences, to give ASCII results.
sub valueencode ($) {
    my $encoded = "";
    for my $byte (unpack 'U0C*', $_[0]) {
        if (($byte >= 0x01 && $byte <= 0x27) ||
            ($byte >= 0x2B && $byte <= 0x5B) ||
            ($byte >= 0x5D && $byte <= 0x7F)) {
            $encoded .= chr $byte;
        } else {
            $encoded .= sprintf '\x', $byte;
        }
    }
    return $encoded;
}

This version returns the strings 'u1' and 'Lu\c4\8di\c4\87' from the above examples.