In the book The Practice of Programming, there's a short snippet of C code that implements a regex matcher. Brian W Kernighan has expanded on that chapter and published it online at https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html
Robs C code relies on the '\0' sentinels for length matching, but that won't work in Rust. So my port of the code ends up with lots of length checks.
I use slices to byte arrays to simplify the task a bit, instead of supporting utf-8 encoded strings.
The implementation consists of three functions re_match
, match_here
and match_star
. I use ranges for iterating over the slices.
// Search for regex anywhere in the text.
fn re_match(regex: &[u8], text: &[u8]) -> bool {
if regex.len() > 0 && regex[0] == b'^' {
return match_here(®ex[1..], &text);
}
// We need to check even if text is empty
if match_here(regex, text) {
return true
}
for i in 0..text.len() {
if match_here(regex, &text[i..]) {
return true
}
}
false
}
// Search for regex at beginning of text
fn match_here(regex: &[u8], text: &[u8]) -> bool {
if regex.len() == 0 {
return true;
}
if regex.len() > 1 && regex[1] == b'*' {
return match_star(regex[0], ®ex[2..], text);
}
if regex.len() == 1 && regex[0] == b'$' {
return text.len() == 0;
}
if text.len() > 0 && (regex[0] == b'.' || regex[0] == text[0]) {
return match_here(®ex[1..], &text[1..]);
}
false
}
// Search for c* regex at beginning of text.
fn match_star(c: u8, regex: &[u8], text: &[u8]) -> bool {
if match_here(regex, text) {
return true;
}
let mut i = 0;
while i < text.len() && (text[i] == c || c == b'.') {
if match_here(regex, &text[i..]) {
return true;
}
i = 1;
}
false
}
Question
- How can I rewrite these kind of functions to not need so many length checks?
- META: Should I use iterators instead of slices as parameters? When choose one over the other?
CodePudding user response:
For the match_here
I would use match
with the underappreciated slice-pattern syntax. That would avoid most len()
uses. Using is_empty()
is nicer than len() == 0
:
fn match_here(regex: &[u8], text: &[u8]) -> bool {
match regex {
&[] => true,
&[b'$'] => {
text.is_empty()
}
&[z, b'*', ref tail @ ..] => {
match_star(z, tail, text)
}
&[z, ref tail @ ..] if if !text.is_empty() && (z == b'.' || z == text[0]) => {
match_here(tail, &text[1..])
}
_ => false,
}
}
Or if you feel fancy you can do a double match:
fn match_here(regex: &[u8], text: &[u8]) -> bool {
match (regex, text) {
(&[], _) => true,
(&[b'$'], &[]) => true,
(&[b'$'], _) => false,
(&[z, b'*', ref tail @ ..], txt) => {
match_star(z, tail, txt)
}
(&[z, ref tail @ ..], &[]) => false,
(&[z, ref tail @ ..], &[tz, ref ttail @ ..]) => if z == b'.' || z == tz => {
match_here(tail, ttail)
}
_ => false,
}
}
The cool thing about this latter option is that since you are never using the index operator [x]
you are sure you will never go out of bounds, without ever checking the len()
of your slices.
The len()
of the other functions, I don't particularly see them as non-idiomatic. They may be rewritten in a more rusty way, but then the equivalence to the C code would not be so obvious. You would need to stop and think!
About using iterators or slices, I personally prefer slices for this kind of things. The problem with iterators is the backbuffer, for example, to check for the x*
you need to take two bytes from the iterator, but then, if the second one is not a *
you have to put it back... Naturally you can use Iterator::peek
but that will only give you one element. If at any time you need to look ahead two bytes, you have a problem.
So unless you need to parse a very big input (> hundreds of MBs) I would stick to the plain slices.