Nom parser that skips escaped terminator characters-CodePudding

I've checked the other SO answers for nom parser combinator questions, but this one doesn't seem to have been asked yet.

I am attempting to parse delimited regular expressions, they will always be delimited with /...../, perhaps with the modifiers at the end (which for all the data I need to parse right now is out of scope.) however if there's an escaped \/ in the middle of the string, my parser is stopping prematurely, on the first / even if it was preceeded with a \.

I have this parser:

use nom::bytes::complete::{tag, take_until};
use nom::{combinator::map_res, sequence::tuple, IResult};
use regex::Regex;

pub fn regex(input: &str) -> IResult<&str, Regex> {
    map_res(
        tuple((tag("/"), take_until("/"), tag("/"))),
        |(_, re, _)| Regex::new(re),
    )(input)
}

Naturally the take_until stops at the first / without noticing that the previous character was a \, I've looked at peek and recognize, and map and a whole bunch of other things, but I'm just coming up short, I feel like I literally want take_until("/") with some kind of either encoding awareness, or simply .. I am anyway, using map_res to hand-off to Rust's regex crate to do the parsing.

I also tried something like this using the escaped combinator, but the examples are somewhat unclear and I couldn't make it work:

pub fn regex(input: &str) -> IResult<&str, Regex> {
    map_res(
        tuple((
            tag("/"),
            escaped(many1(anychar), '\\', one_of(r"/")),
            tag("/"),
        )),
        |(_, re, _)| {
            println!("mapres {}", re);
            Regex::new(re)
        },
    )(input)
}

My test cases are as such (the .unwrap().as_str() is just to have a small example, since regex::Regex doesn't implement PartialEq):

#[cfg(test)]
mod tests {
    use super::regex;
    use super::Regex;
    #[test]
    fn test_parse_regex_simple() {
        assert_eq!(
            Regex::new(r#"hello world"#).unwrap().as_str(),
            regex("/hello world/").unwrap().1.as_str()
        );
    }
    #[test]
    fn test_parse_regex_with_escaped_forwardslash() {
        assert_eq!(
            Regex::new(r#"hello /world"#).unwrap().as_str(),
            regex(r"/hello \/world/").unwrap().1.as_str(),
        );
    }
}

CodePudding user response：

The parser passed as the first argument to escaped() should parse one character that is not the escape character, and stop on the correct character(s). many1(anychar) does not answer any of these conditions.

Rather, you should call it this way:

escaped(none_of(r"\/"), '\\', one_of(r"/"))

Or the whole expression:

map_res(
    tuple((
        tag("/"),
        escaped(none_of(r"\/"), '\\', one_of(r"/")),
        tag("/"),
    )),
    |(_, re, _)| Regex::new(re),
)(input)

But it doesn't work. Because Regex's escape sequences don't include /. So you need to strip the escape characters. Luckily, escaped_transform() is here to help you:

map_res(
    tuple((
        tag("/"),
        escaped_transform(none_of(r"\/"), '\\', one_of(r"/")),
        tag("/"),
    )),
    |(_, re, _)| Regex::new(&re), // We need a little `&` here because `escape_transform()` returns a `String` but `Regex::new()` wants `&str`
)(input)

CodePudding user response：

The accepted answer from Chayim Friedman is correct, I however was able to extend it also to handle \w \d and other such modifiers thusly, it's simply an extension of Chayim's idea in the escaped_transform version:


pub fn regex(input: &str) -> IResult<&str, Regex> {
    map_res(
        delimited(
            tag("/"),
            escaped_transform(
                none_of("\\/"),
                '\\',
                alt((
                    value(r"/", tag("/")),
                    value(r"\d", tag("d")),
                    value(r"\W", tag("W")),
                    value(r"\w", tag("w")),
                    value(r"\b", tag("b")),
                    value(r"\B", tag("B")),
                )),
            ),
            tag("/"),
        ),
        |re| Regex::new(&re),
    )(input)
}

note this list is also incomplete, but https://docs.rs/regex/1.5.6/regex/#escape-sequences gives a complete set of escapes, and https://github.com/Geal/nom/blob/main/examples/string.rs gives a more detailed explanation of how to handle \u{....} type escape sequences.