Is there a way to selectively remove spaces in a string, in bash? e.g.
hello world你好 世界!
hello world你 好 世 界!
hello world 你 好 世 界!
你 好 世 界 hello world
and output:
hello world你好世界!
你好世界hello world
Notice I want to preserve spaces between English words or simply English alphabet, but not the others.
I understand python.re
module is probably good for this, but i prefer a bash command if possible.
CodePudding user response:
You can use sed:
echo hello world你好 世界! | sed -E "s/([^a-zA-Z]) ([^a-zA-Z])/\1\2/g"
([^a-zA-Z]) ([^a-zA-Z])
is a regular expression matching a whitespace between two non latin characters (^
negates). The preceding and following characters are captured in groups (#1 and #2)\1\2
is the replacement string (only groups without whitespace in-between)
Output:
hello world你好世界!
Note: to replace starting and trailing whitespaces, your expression should be:
(^|[^a-zA-Z]) ([^a-zA-Z]|$)
Edit: One thing I didn't take into account is that this kind of expression consumes the characters before and after the whitespaces. So in the case 你 好 世 界 hello world
a whitespace was still remaining. You then have to use a regex engine that supports lookarounds:
echo " 你 好 世 界 hello world, !" | perl -pe "s/(?<=^|[^[:ascii:]]) | (?=[^[:ascii:]]|$)//g"
Output:
你好世界hello world
In order to remove space between latin chars/kandji I split the expression in two. I also replaced the condition on latin character with ascii. Should give more appropriate matches
CodePudding user response:
A perl
solution using Unicode properties (In particular, if a character is or isn't in the latin script:
$ perl -CSD -lpe 's/^\s //; # Remove leading spaces
s/\s $//; # Remove trailing spaces
# Remove spaces between two non-latin characters.
s/(\P{scx=Latin})\s (?=\P{scx=Latin})/$1/g;
# Remove spaces between a leading latin and trailing non-latin
s/(\p{scx=Latin})\s (?=\P{scx=Latin})/$1/g;
# Remove spaces between a leading non-latin and trailing latin
s/(\P{scx=Latin})\s (?=\p{scx=Latin})/$1/g;' input.txt
hello world你好世界!
hello world你好世界!
hello world你好世界!
你好世界hello world
It does a bunch of substitutions for the different cases where you want to remove spaces instead of trying to use a single regular expression to match every possibility.