Regex for useless space in form's inputs-CodePudding

I don't understand how to write a regex for these multiple patterns to be removed (with something like .replace(pattern, "")):

two or more spaces not in a string removed
two or more spaces in a string reduced to one (ex: " text other " -> "text other")
one or more spaces removed after and before characters such as:
1. \n
2. \r\n
3. \t
replace \r\n with \n

I tried with |\\n |\t \\r\n . but obviously this doesn't work totally.

We can use the below patterns to check it's working:

assert_eq!(not_useful_space("   "), "");
assert_eq!(not_useful_space("    a l    l   lower      "), "a l l lower");
assert_eq!(not_useful_space("    i need\n new lines\n\n many times     "), "i need\nnew lines\n\nmany times");
assert_eq!(not_useful_space("    i need  \n new lines \n\n many times     "), "i need\nnew lines\n\nmany times");
assert_eq!(not_useful_space("  i need \r\n new lines\r\nmany times   "), "i need\nnew lines\nmany times");
assert_eq!(not_useful_space("    i need \t new lines\t \t many times     "), "i need new lines many times");
assert_eq!(not_useful_space("  à   la  "), "à la");

CodePudding user response：

You can do this in a single regex with MULTILINE flag enabled:

(?m)[ \t]*\r[ \t]*|^[ \t] |[ \t] $|\t] $|([ \t]){2,}

Replace it with $1 string.

Rust Code Demo

Rust Code:

use once_cell::sync::Lazy;
use regex::Regex;

pub fn magic(input: &str) -> String {
    static REGEX: Lazy<Regex> = Lazy::new(|| {
        Regex::new(r"(?m)[ \t]*\r[ \t]*|^[ \t] |[ \t] $|\t] $|([ \t]){2,}").unwrap()
    });
    
    REGEX.replace_all(input, "$1").to_string()
}

#[cfg(test)]
fn magic_data() -> std::collections::HashMap<&'static str, &'static str> {
    std::collections::HashMap::from([
        ("   ", ""),
        ("    a l    l   lower      ", "a l l lower"),
        (
            "    i need\n new lines\n\n many times     ",
            "i need\nnew lines\n\nmany times",
        ),
        (
            "    i need  \n new lines \n\n many times     ",
            "i need\nnew lines\n\nmany times",
        ),
        (
            "  i need \r\n new lines\r\nmany times   ",
            "i need\nnew lines\nmany times",
        ),
        (
            "    i need \t new lines\t \t many times     ",
            "i need new lines many times",
        ),
        ("  à   la  ", "à la"),
    ])
}

#[test]
fn test() {
    for (k, v) in magic_data() {
        assert_eq!(magic(k), v)
    }
}

Javascript Demo:

function assert_eq(lhs, rhs) {
  console.log(lhs == rhs);
}

function not_useful_space(str) {
  return str.replace(/^[ \t] |[ \t] $|\r|([ \t]){2,}/mg, '$1');
}

assert_eq(not_useful_space("   "), "");
assert_eq(not_useful_space("    a l    l   lower      "), "a l l lower");
assert_eq(not_useful_space("    i need\n new lines\n\n many times     "), "i need\nnew lines\n\nmany times");
assert_eq(not_useful_space("    i need  \n new lines \n\n many times     "), "i need\nnew lines\n\nmany times");
assert_eq(not_useful_space("  i need \r\n new lines\r\nmany times   "), "i need\nnew lines\nmany times");
assert_eq(not_useful_space("    i need \t new lines\t \t many times     "), "i need new lines many times");
assert_eq(not_useful_space("  à   la  "), "à la");

RegEx Breakup:

^: start
[ \t]*\r[ \t]*: Match \r surrounded with optional spaces on both sides
[ \t] : match 1 of space or tab characters
|: OR
[ \t] : match 1 of space or tab characters
$: end
|: OR
([ \t]){2,}: match 2 of space or tab characters
$1: Is replacement to get single space/tab character back in substitution

CodePudding user response：

If you're interested, here's a non-regex version:

fn not_useful_space(text: &str) -> String {
    text.lines()
        .map(|line| {
            line.trim()
                .split_ascii_whitespace()
                .collect::<Vec<_>>()
                .join(" ")
        })
        .collect::<Vec<_>>()
        .join("\n")
}

Playground