bash how to selectively remove space in a string-CodePudding

Is there a way to selectively remove spaces in a string, in bash? e.g.

hello world你好 世界！
hello world你 好 世 界！
hello world 你 好 世 界！
你 好 世 界 hello world

and output:

hello world你好世界！
你好世界hello world

Notice I want to preserve spaces between English words or simply English alphabet, but not the others.

I understand python.re module is probably good for this, but i prefer a bash command if possible.

CodePudding user response：

You can use sed:

echo hello world你好 世界！ | sed -E "s/([^a-zA-Z]) ([^a-zA-Z])/\1\2/g"

([^a-zA-Z]) ([^a-zA-Z]) is a regular expression matching a whitespace between two non latin characters (^ negates). The preceding and following characters are captured in groups (#1 and #2)
\1\2 is the replacement string (only groups without whitespace in-between)

Output:

hello world你好世界！

Note: to replace starting and trailing whitespaces, your expression should be:

(^|[^a-zA-Z]) ([^a-zA-Z]|$)

Edit: One thing I didn't take into account is that this kind of expression consumes the characters before and after the whitespaces. So in the case 你好世界 hello world a whitespace was still remaining. You then have to use a regex engine that supports lookarounds:

echo " 你 好 世 界 hello world, !"  | perl -pe "s/(?<=^|[^[:ascii:]]) | (?=[^[:ascii:]]|$)//g"

Output:

你好世界hello world

In order to remove space between latin chars/kandji I split the expression in two. I also replaced the condition on latin character with ascii. Should give more appropriate matches

CodePudding user response：

A perl solution using Unicode properties (In particular, if a character is or isn't in the latin script:

$ perl -CSD -lpe 's/^\s //; # Remove leading spaces
                  s/\s $//; # Remove trailing spaces
                  # Remove spaces between two non-latin characters.
                  s/(\P{scx=Latin})\s  (?=\P{scx=Latin})/$1/g; 
                  # Remove spaces between a leading latin and trailing non-latin
                  s/(\p{scx=Latin})\s  (?=\P{scx=Latin})/$1/g;
                  # Remove spaces between a leading non-latin and trailing latin
                  s/(\P{scx=Latin})\s  (?=\p{scx=Latin})/$1/g;' input.txt
hello world你好世界！
hello world你好世界！
hello world你好世界！
你好世界hello world

It does a bunch of substitutions for the different cases where you want to remove spaces instead of trying to use a single regular expression to match every possibility.