Home > Software design >  Using regex in Rust to make a map from a file
Using regex in Rust to make a map from a file

Time:11-21

I am pulling my hair out (and I don't have much left) trying to figure out how to use Rust.

Here is what I am trying to do

    let mut map = HashMap::new();

    let input = File::open(filename).unwrap();
    let reader = BufReader::new(input);

    let re = Regex::new(r"(. )\)(. )").unwrap();

    for line in reader.lines() {
        if let Some(captures) = re.captures(&line.unwrap()) {
            let key = captures.get(1).map_or("", |m| m.as_str());
            let value = captures.get(2).map_or("", |m| m.as_str());

            println!("{} {}", key, value);

            map.entry(key).or_insert(Vec::new()).push(value);
        }
    }
    println!("{:?}", map);

I get this compiler error:

error[E0716]: temporary value dropped while borrowed
  --> src/main.rs:22:46
   |
22 |         if let Some(captures) = re.captures(&line.unwrap()) {
   |                                              ^^^^^^^^^^^^^ creates a temporary which is freed while still in use
...
30 |     }
   |     - temporary value is freed at the end of this statement
31 |     println!("{:?}", map);
   |                      --- borrow later used here
   |
   = note: consider using a `let` binding to create a longer lived value

It is already looking really ugly and the compiler is unhappy. Any suggestions on how to do this in a more "rusty" way that will work?

CodePudding user response:

I know how you feel... seemingly simple things like this can get complicated in Rust quickly. It's definitely not a succinct language, and that verbosity is tough for me to deal with as well, but it forces you "go through the motions" in order to end up with a provably-correct application.

To break this down, let's start with line

  • The reader is reading the file line by line, with the happy path creating a String
  • for line in takes ownership of said String in the line variable
  • line falls out of scope at the end of each loop
// line is "owned by" this variable
for line in reader.lines() {
    // so no matter what is in here
} // `line` gets delete from memory here

Next up, all of the "to_string stuff". Much of Rust's speed and memory efficiency is based on "zero cost abstractions", a.k.a "avoiding lots of copies of things". This kind of operation can be really expensive in other languages. In addition to the actual work of checking the pattern against the string, if it matches, in total you would have:

  • a String for the whole line
  • another String for matching part (capture[0])
  • 2 more String for each of the capture groups

That's triple the amount of "stuff" in memory, it's not free to allocate it for each loop, and the CPU had to do the work of copying it.

Instead, for many of these types of operations where you see &str, it's a "slice". Instead of a new copy, it's basically just some indexes. E.g. if you have a string such as:

this is a test

A slice for is a would actually be something like:

{
  referenceToString: <some_memory_address>,
  start: 5, // number of characters to the beginning
  length: 4,
}

All of the things in the captures result are based on these. Remember that line is removed from memory after the loop. Any of those slices referring to <some_memory_address> would no longer be valid, so you can't put them into your HashMap. You need to call to_string to copy the actual text into a new value. This is the cause of the majority of these kinds of errors. In c , you could hold on to such a construct, but the stuff at that memory location would be replaced with something else. Your map would have got_knows_what in both keys and values, and they would constantly change as new stuff was written to memory.


Next note that line is a Result<String, Error>. Any number of things could go wrong in the middle of reading the file. Thus, line is either the next line as a String, or an Error that you need to handle. It's like an "inverted try/catch", if you will. You can "kick the can" with a trailing ?, but then the current function will have to return a Result<Whatever, Error>, making it the caller's problem.

Using unwrap is always a "deal with the devil". If you're truly, absolutely, 100% sure that there can't really be an error there, you can use it, but there could always be an error during I/O, so it's never a safe assumption there. unwrap bypasses the error handling mechanism of the language, so if you unwrap an error, the program crashes, effectively saying "I tried to warn ya..."


Finally, as @Pitaj mentioned, reader.lines creates a new String for each line. There's some overhead in it having to find empty memory over and over, so you can be more efficient -- particularly for larger files -- with reader.read_line.


So first, let's make your mess a lot worse, with comments inline

// notice we return a `Result`, because we're not going to deal with errors here
fn get_matches(filename: &str) -> io::Result<HashMap<String, Vec<String>>> {
    let re = Regex::new(r"(. )\)(. )").unwrap();
    let mut map = HashMap::new();

    // `?` to "kick the can" on an error opening the file
    let file = File::open(filename)?;
    // the reader is `mut` because... well... because the compiler
    // told me it should be. ;-)
    let mut reader = BufReader::new(file);
    // the memory buffer we're going to reuse for each loop
    let mut buffer = String::new();

    // `read_line` returns the size read, which is 0 when we're finished
    // `?` to "kick the can" on an error reading the line
    while reader.read_line(&mut buffer)? > 0 {
        // if the regex matches
        if let Some(captures) = re.captures(&buffer) {
            // convenience closure to unpack
            let capture = |index| {
                // the capture group will always be a `Some` here, so you *could*
                // use `unwrap` here. But it would not be if your regex used a
                // `?` (optional) capture, so safer to "go through the motions"
                if let Some(m) = captures.get(index) {
                    // the match is a `Match` instance, so convert
                    // to `&str` with `as_str`
                    let str_slice = m.as_str();
                    // we can't keep a `&str` after the loop, so
                    // create a new `String` with its contents
                    let string = str_slice.to_string();
                    // the result is an `Option`
                    return Some(string)
                }
                // if the match were `None`, this `capture` function
                // wouldn't have a `String` to return
                None
            };

            // if both captures exist, `k` and `v` are `String`
            if let (Some(k), Some(v)) = (capture(1), capture(2)) {
                // and now `map` owns both of these `String` values
                map.entry(k).or_insert(Vec::new()).push(v);
            }
        }
        // empty out the buffer, but keep the memory it allocated
        buffer.clear();
    }
    Ok(map)
}

After all of that, this doesn't seem so bad. ;-)

fn get_matches(filename: &str) -> io::Result<HashMap<String, Vec<String>>> {
    let re = Regex::new(r"(. )\)(. )").unwrap();
    let mut map = HashMap::new();

    let file = File::open(filename)?;
    let mut reader = BufReader::new(file);
    let mut buffer = String::new();

    while reader.read_line(&mut buffer)? > 0 {
        if let Some(captures) = re.captures(&buffer) {
            let capture = |index| {
                captures.get(index).map(|c| c.as_str().to_string())
            };

            if let (Some(k), Some(v)) = (capture(1), capture(2)) {
                map.entry(k).or_insert(Vec::new()).push(v);
            }
        }
        buffer.clear();
    }
    Ok(map)
}
  • Related