What is the most efficient way to keep a collection of references to strings in Rust?
Specifically, I have the following as the beginning of some code to parse command line arguments (option parsing to be added):
let args: Vec<String> = env::args().collect();
let mut files: Vec<&String> = Vec::new();
let mut i = 1;
while i < args.len() {
let arg = &args[i];
i = 1;
if arg.as_bytes()[0] != b'-' {
files.push(arg);
continue;
}
}
args
is as recommended in https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html declared as Vec<String>
. As I understand it, that means new strings are constructed, which is mildly surprising; I would've expected that the command line arguments already exist in memory, and it would only be necessary to make a vector of references to the existing strings. But the compiler seems to concur that it needs to be Vec<String>
.
It would seem inefficient to do the same for files
; there is surely no need for further copying. Instead, I have declared it as Vec<&String>
, which as I understand it, means only creating a vector of references to the existing strings, which is optimal. (Not that it makes a measurable performance difference for command line arguments, but I want to figure this out now, so I can get it right later when dealing with much larger data.)
Where I am slightly confused is that Rust seems to frequently recommend str
over String
, and indeed the compiler is happy to have files
hold either str
or &str
.
My best guess right now is that str
, being an object that refers to a slice of a string, is most efficient when you want to keep a reference to just part of the string, but when you know you want the whole string, it is better to skip the overhead of creating a slice object, and just keep &String
.
Is the above correct, or am I missing something?
CodePudding user response:
args
is as recommended in https://doc.rust-lang.org/book/ch12-01-accepting-command-line-arguments.html declared asVec<String>
. As I understand it, that means new strings are constructed, which is mildly surprising; I would've expected that the command line arguments already exist in memory
The command-line arguments do exist in memory but
- they are not
String
, they are not even guaranteed to be UTF8 - they are not in a
Vec
layout
Fundamentally there isn't even any prescription as to their storage, all you know is they're C strings (nul-terminated) and you get an array of pointers to those, whose last element is a null pointer.
Which is why args
is an iterator of String
: it will lazily decode and validate each argument as you request it, in fact you can check its source code:
pub fn args() -> Args {
Args { inner: args_os() }
}
#[stable(feature = "env", since = "1.0.0")]
impl Iterator for Args {
type Item = String;
fn next(&mut self) -> Option<String> {
self.inner.next().map(|s| s.into_string().unwrap())
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.inner.size_hint()
}
}
Now I couldn't tell you why args_os
yields OsString
rather than OsStr
, I would assume portability of some sort (e.g. some platforms might not guarantee the args data lives for the entirety of the program).
My best guess right now is that str, being an object that refers to a slice of a string, is most efficient when you want to keep a reference to just part of the string, but when you know you want the whole string, it is better to skip the overhead of creating a slice object, and just keep &String.
Is the above correct, or am I missing something?
&String
exists only for regularity (in the sense that it's a natural outgrowth of shared references and String
existing concurrently), it's not actually useful: an &String
only lets you access readonly / immutable methods of String
, all of which are really provided by str
aside from capacity()
(which is rarely useful) and a handful of methods duplicated from str
to String
(I assume for efficiency) like len
or is_empty
.
&str
is also generally more efficient than &String
: while its size is 2 words (pointer, length) rather than one (pointer), it points directly to the relevant data rather than pointing to a pointer to the relevant data (and requiring a dereference to access the length property). As such, &String
is rarely considered useful and clippy will warn against it by default (also &Vec
as &[]
is usually better for the same reason).