Idiomatic rust rewrite of Rob Pikes regex-CodePudding

In the book The Practice of Programming, there's a short snippet of C code that implements a regex matcher. Brian W Kernighan has expanded on that chapter and published it online at https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html

Robs C code relies on the '\0' sentinels for length matching, but that won't work in Rust. So my port of the code ends up with lots of length checks.

I use slices to byte arrays to simplify the task a bit, instead of supporting utf-8 encoded strings.

The implementation consists of three functions re_match, match_here and match_star. I use ranges for iterating over the slices.

// Search for regex anywhere in the text.
fn re_match(regex: &[u8], text: &[u8]) -> bool {
    if regex.len() > 0 && regex[0] == b'^' {
        return match_here(&regex[1..], &text);
    }
    // We need to check even if text is empty
    if match_here(regex, text) {
        return true
    }
    for i in 0..text.len() {
        if match_here(regex, &text[i..]) {
            return true
        }
    }
    false 
}

// Search for regex at beginning of text
fn match_here(regex: &[u8], text: &[u8]) -> bool {
    if regex.len() == 0 {
        return true;
    }
    if regex.len() > 1 && regex[1] == b'*' {
        return match_star(regex[0], &regex[2..], text);
    }

    if regex.len() == 1 && regex[0] == b'$' {
        return text.len() == 0;
    }

    if text.len() > 0 && (regex[0] == b'.' || regex[0] == text[0]) {
        return match_here(&regex[1..], &text[1..]);
    }
    false
}

// Search for c* regex at beginning of text.
fn match_star(c: u8, regex: &[u8], text: &[u8]) -> bool {
    if match_here(regex, text) {
        return true;
    }
    let mut i = 0;
    while i < text.len() && (text[i] == c || c == b'.') {
        if match_here(regex, &text[i..]) {
            return true;
        }
        i  = 1;
    }
    false
}

Question

How can I rewrite these kind of functions to not need so many length checks?
META: Should I use iterators instead of slices as parameters? When choose one over the other?

CodePudding user response：

For the match_here I would use match with the underappreciated slice-pattern syntax. That would avoid most len() uses. Using is_empty() is nicer than len() == 0:

fn match_here(regex: &[u8], text: &[u8]) -> bool {
    match regex {
        &[] => true,
        &[b'$'] => {
            text.is_empty()
        }
        &[z, b'*', ref tail @ ..] => {
            match_star(z, tail, text)
        }
        &[z, ref tail @ ..] if  if !text.is_empty() && (z == b'.' || z == text[0]) => {
            match_here(tail, &text[1..])
        }
        _ => false,
    }
}

Or if you feel fancy you can do a double match:

fn match_here(regex: &[u8], text: &[u8]) -> bool {
    match (regex, text) {
        (&[], _) => true,
        (&[b'$'], &[]) => true,
        (&[b'$'], _) => false,
        (&[z, b'*', ref tail @ ..], txt) => {
            match_star(z, tail, txt)
        }
        (&[z, ref tail @ ..], &[]) => false,
        (&[z, ref tail @ ..], &[tz, ref ttail @ ..]) => if z == b'.' || z == tz => {
            match_here(tail, ttail)
        }
        _ => false,
    }
}

The cool thing about this latter option is that since you are never using the index operator [x] you are sure you will never go out of bounds, without ever checking the len() of your slices.

The len() of the other functions, I don't particularly see them as non-idiomatic. They may be rewritten in a more rusty way, but then the equivalence to the C code would not be so obvious. You would need to stop and think!

About using iterators or slices, I personally prefer slices for this kind of things. The problem with iterators is the backbuffer, for example, to check for the x* you need to take two bytes from the iterator, but then, if the second one is not a * you have to put it back... Naturally you can use Iterator::peek but that will only give you one element. If at any time you need to look ahead two bytes, you have a problem.

So unless you need to parse a very big input (> hundreds of MBs) I would stick to the plain slices.