Home > Mobile >  What's the ideal way to trim extra spaces from a string?
What's the ideal way to trim extra spaces from a string?

Time:04-14

I'm dealing with strings where I need to replace multiple spaces with just a single space . It looks like most of these are just human error, but am curious on the ideal way to handle this -- preferrably with the least allocations from &str to String.

So far this is my approach below:

const SPACE: &str = " ";
const TWO_SPACES: &str = "  ";

/// Replace multiple spaces with a single space
pub fn trim_whitespace(s: &str) -> String {
    let mut new_str: String = s.trim().to_owned();
    while new_str.contains(TWO_SPACES) {
        new_str = new_str.replace(TWO_SPACES, SPACE);
    }
    new_str
}

let result = trim_whitespace("Hello     world! ");
assert_eq!(result, "Hello world!");

CodePudding user response:

split_whitespace() is very convenient for this usage.

A vector and a string are allocated in the very simple first solution.

The second solution allocates only a string, but is a bit inelegant (an if at each iteration).

pub fn trim_whitespace_v1(s: &str) -> String {
    // first attempt: allocates a vector and a string
    let words: Vec<_> = s.split_whitespace().collect();
    words.join(" ")
}

pub fn trim_whitespace_v2(s: &str) -> String {
    // second attempt: only allocate a string
    let mut result = String::with_capacity(s.len());
    s.split_whitespace().for_each(|w| {
        if !result.is_empty() {
            result.push(' ');
        }
        result.push_str(w);
    });
    result
}

fn main() {
    let source = "  a   bb cc   ddd    ";
    println!("{:?}", trim_whitespace_v1(source)); // "a bb cc ddd"
    println!("{:?}", trim_whitespace_v2(source)); // "a bb cc ddd"
}

CodePudding user response:

You can use split(' '), filter out empty entries then re-join by space:

s.trim()
    .split(' ')
    .filter(|s| !s.is_empty())
    .collect::<Vec<_>>()
    .join(" ")

// Or, using itertools:
use itertools::Itertools;
s.trim().split(' ').filter(|s| !s.is_empty()).join(" ")

Another possibility is to use String::retain() and remove consecutive spaces. It should also be faster since it allocates only once, for the trimmed string:

pub fn trim_whitespace(s: &str) -> String {
    let mut new_str = s.trim().to_owned();
    let mut prev = ' '; // The initial value doesn't really matter
    new_str.retain(|ch| {
        let result = ch != ' ' || prev != ' ';
        prev = ch;
        result
    });
    new_str
}

Edit: I was curious and so I benchmarked all versions suggested here with the string " a bb cc ddd " (of course, different strings will have different performance charateristics). Benchmark code is here (requires criterion and itertools).

Results:

benches/trim_whitespace_replace
                        time:   [846.02 ns 872.71 ns 901.90 ns]
benches/trim_whitespace_retain
                        time:   [146.79 ns 153.07 ns 159.91 ns]
benches/trim_whitespace_split_space
                        time:   [268.61 ns 277.44 ns 287.55 ns]
benches/trim_whitespace_split_space_itertools
                        time:   [392.82 ns 406.92 ns 423.88 ns]
benches/trim_whitespace_split_whitespace
                        time:   [236.38 ns 244.51 ns 254.00 ns]
benches/trim_whitespace_split_whitespace_itertools
                        time:   [395.82 ns 413.59 ns 433.26 ns]
benches/trim_whitespace_split_whitespace_only_one_string
                        time:   [146.25 ns 152.73 ns 159.94 ns]

As expected, your version using replace() is the slowest. My versions using retain() and @prog-fh's faster version are the fastest (I expected his version to be faster because it needs to copy less, but apparently the difference is very small and modern CPUs copy small blocks of memory very fast. Maybe in larger strings this will show up). Somewhat surprisingly, the versions using itertools' join() are slower than the versions that use only the standard library collect() then join(), despite not need to first collecting into a vector. I can explain it though - this versions use dynamic dispatch over Display the compiler may not be able to eliminate (I'm not sure though, need to check the assembly to verify), and worse, they may actually need to allocate more because they don't know the amount of space required ahead of time and they also need to insert the separator.

  • Related