I'm dealing with strings where I need to replace multiple spaces
with just a single space
. It looks like most of these are just human error, but am curious on the ideal way to handle this -- preferrably with the least allocations from &str
to String
.
So far this is my approach below:
const SPACE: &str = " ";
const TWO_SPACES: &str = " ";
/// Replace multiple spaces with a single space
pub fn trim_whitespace(s: &str) -> String {
let mut new_str: String = s.trim().to_owned();
while new_str.contains(TWO_SPACES) {
new_str = new_str.replace(TWO_SPACES, SPACE);
}
new_str
}
let result = trim_whitespace("Hello world! ");
assert_eq!(result, "Hello world!");
CodePudding user response:
split_whitespace()
is very convenient for this usage.
A vector and a string are allocated in the very simple first solution.
The second solution allocates only a string, but is a bit inelegant (an if
at each iteration).
pub fn trim_whitespace_v1(s: &str) -> String {
// first attempt: allocates a vector and a string
let words: Vec<_> = s.split_whitespace().collect();
words.join(" ")
}
pub fn trim_whitespace_v2(s: &str) -> String {
// second attempt: only allocate a string
let mut result = String::with_capacity(s.len());
s.split_whitespace().for_each(|w| {
if !result.is_empty() {
result.push(' ');
}
result.push_str(w);
});
result
}
fn main() {
let source = " a bb cc ddd ";
println!("{:?}", trim_whitespace_v1(source)); // "a bb cc ddd"
println!("{:?}", trim_whitespace_v2(source)); // "a bb cc ddd"
}
CodePudding user response:
You can use split(' ')
, filter out empty entries then re-join by space:
s.trim()
.split(' ')
.filter(|s| !s.is_empty())
.collect::<Vec<_>>()
.join(" ")
// Or, using itertools:
use itertools::Itertools;
s.trim().split(' ').filter(|s| !s.is_empty()).join(" ")
Another possibility is to use String::retain()
and remove consecutive spaces. It should also be faster since it allocates only once, for the trimmed string:
pub fn trim_whitespace(s: &str) -> String {
let mut new_str = s.trim().to_owned();
let mut prev = ' '; // The initial value doesn't really matter
new_str.retain(|ch| {
let result = ch != ' ' || prev != ' ';
prev = ch;
result
});
new_str
}
Edit: I was curious and so I benchmarked all versions suggested here with the string " a bb cc ddd "
(of course, different strings will have different performance charateristics). Benchmark code is here (requires criterion
and itertools
).
Results:
benches/trim_whitespace_replace
time: [846.02 ns 872.71 ns 901.90 ns]
benches/trim_whitespace_retain
time: [146.79 ns 153.07 ns 159.91 ns]
benches/trim_whitespace_split_space
time: [268.61 ns 277.44 ns 287.55 ns]
benches/trim_whitespace_split_space_itertools
time: [392.82 ns 406.92 ns 423.88 ns]
benches/trim_whitespace_split_whitespace
time: [236.38 ns 244.51 ns 254.00 ns]
benches/trim_whitespace_split_whitespace_itertools
time: [395.82 ns 413.59 ns 433.26 ns]
benches/trim_whitespace_split_whitespace_only_one_string
time: [146.25 ns 152.73 ns 159.94 ns]
As expected, your version using replace()
is the slowest. My versions using retain()
and @prog-fh's faster version are the fastest (I expected his version to be faster because it needs to copy less, but apparently the difference is very small and modern CPUs copy small blocks of memory very fast. Maybe in larger strings this will show up). Somewhat surprisingly, the versions using itertools' join()
are slower than the versions that use only the standard library collect()
then join()
, despite not need to first collecting into a vector. I can explain it though - this versions use dynamic dispatch over Display
the compiler may not be able to eliminate (I'm not sure though, need to check the assembly to verify), and worse, they may actually need to allocate more because they don't know the amount of space required ahead of time and they also need to insert the separator.