Home > Enterprise >  Most efficient way to keep collection of string references
Most efficient way to keep collection of string references

Time:12-21

What is the most efficient way to keep a collection of references to strings in Rust?

Specifically, I have the following as the beginning of some code to parse command line arguments (option parsing to be added):

let args: Vec<String> = env::args().collect();
let mut files: Vec<&String> = Vec::new();
let mut i = 1;
while i < args.len() {
    let arg = &args[i];
    i  = 1;
    if arg.as_bytes()[0] != b'-' {
        files.push(arg);
        continue;
    }
}

args is as recommended in https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html declared as Vec<String>. As I understand it, that means new strings are constructed, which is mildly surprising; I would've expected that the command line arguments already exist in memory, and it would only be necessary to make a vector of references to the existing strings. But the compiler seems to concur that it needs to be Vec<String>.

It would seem inefficient to do the same for files; there is surely no need for further copying. Instead, I have declared it as Vec<&String>, which as I understand it, means only creating a vector of references to the existing strings, which is optimal. (Not that it makes a measurable performance difference for command line arguments, but I want to figure this out now, so I can get it right later when dealing with much larger data.)

Where I am slightly confused is that Rust seems to frequently recommend str over String, and indeed the compiler is happy to have files hold either str or &str.

My best guess right now is that str, being an object that refers to a slice of a string, is most efficient when you want to keep a reference to just part of the string, but when you know you want the whole string, it is better to skip the overhead of creating a slice object, and just keep &String.

Is the above correct, or am I missing something?

CodePudding user response:

args is as recommended in https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html declared as Vec<String>. As I understand it, that means new strings are constructed, which is mildly surprising; I would've expected that the command line arguments already exist in memory

The command-line arguments do exist in memory but

  • they are not String, they are not even guaranteed to be UTF8
  • they are not in a Vec layout

Fundamentally there isn't even any prescription as to their storage, all you know is they're C strings (nul-terminated) and you get an array of pointers to those, whose last element is a null pointer.

Which is why args is an iterator of String: it will lazily decode and validate each argument as you request it, in fact you can check its source code:

pub fn args() -> Args {
    Args { inner: args_os() }
}
#[stable(feature = "env", since = "1.0.0")]
impl Iterator for Args {
    type Item = String;
    fn next(&mut self) -> Option<String> {
        self.inner.next().map(|s| s.into_string().unwrap())
    }
    fn size_hint(&self) -> (usize, Option<usize>) {
        self.inner.size_hint()
    }
}

Now I couldn't tell you why args_os yields OsString rather than OsStr, I would assume portability of some sort (e.g. some platforms might not guarantee the args data lives for the entirety of the program).

My best guess right now is that str, being an object that refers to a slice of a string, is most efficient when you want to keep a reference to just part of the string, but when you know you want the whole string, it is better to skip the overhead of creating a slice object, and just keep &String.

Is the above correct, or am I missing something?

&String exists only for regularity (in the sense that it's a natural outgrowth of shared references and String existing concurrently), it's not actually useful: an &String only lets you access readonly / immutable methods of String, all of which are really provided by str aside from capacity() (which is rarely useful) and a handful of methods duplicated from str to String (I assume for efficiency) like len or is_empty.

&str is also generally more efficient than &String: while its size is 2 words (pointer, length) rather than one (pointer), it points directly to the relevant data rather than pointing to a pointer to the relevant data (and requiring a dereference to access the length property). As such, &String is rarely considered useful and clippy will warn against it by default (also &Vec as &[] is usually better for the same reason).

  • Related