I have a file with non-ascii characters.
$ org od -t c -t x1 -A d tmp.txt
0000000 S o - c a l l e d 217 204 l a b
53 6f 2d 63 61 6c 6c 65 64 f4 8f b1 84 6c 61 62
0000016 e l e d 217 204 p a t t e r n s
65 6c 65 64 f4 8f b1 84 70 61 74 74 65 72 6e 73
0000032 217 204 c a n b e 217 204 u s
f4 8f b1 84 63 61 6e 20 62 65 f4 8f b1 84 75 73
0000048 e d 217 204 w i t h 217 204 s i
65 64 f4 8f b1 84 77 69 74 68 f4 8f b1 84 73 69
0000064 n g l e , 217 204 d o u b l e
6e 67 6c 65 2c 20 f4 8f b1 84 64 6f 75 62 6c 65
0000080 , 217 204 a n d 217 204 t r i
2c 20 f4 8f b1 84 61 6e 64 f4 8f b1 84 74 72 69
0000096 p l e 217 204 b l a n k s .
70 6c 65 f4 8f b1 84 62 6c 61 6e 6b 73 2e
As you can see, \x{f4}\x{8f}\x{b1}\x{84}
has several occurrences. I want to replace \x{f4}\x{8f}\x{b1}\x{84}
with whitespace. According to this, I try:
s/\x{f4}\x{8f}\x{b1}\x{84}/ /g;
tr/\x{f4}\x{8f}\x{b1}\x{84}/ /;
It doesn't work. But if I remove this two lines in the script:
use utf8;
use open qw( :std :encoding(UTF-8) );
It works. Why?
I suspect that it is because perl only deals with characters, but \x{f4}\x{8f}\x{b1}\x{84}
is not regarded as a character. Is there a way to remove \x{f4}\x{8f}\x{b1}\x{84}
or any other binary contents or non UTF-8 characters with perl?
CodePudding user response:
While the file may contain "\x{f4}\x{8f}\x{b1}\x{84}"
, your string contains "\x{10FC44}"
— "\N{U 10FC44}"
if you prefer — because you decoded what you read. As such, you'd need
tr/\N{U 10FC44}/ /
It's a private-use Code Point. To replace all 137,468 private-use Code Points, you can use
s/\p{General_Category=Private_Use}/ /g
General_Category
can be abbreviated to Gc
.
Private_Use
can be abbreviated to Co
.
General_Category=
can be omitted.
So these are equivalent:
s/\p{Gc=Private_Use}/ /g
s/\p{Private_Use}/ /g
s/\p{Co}/ /g
Co
makes me think of "control", so maybe it's best to avoid that one. (Controls characters are identified by the Control
aka Cc
general category.)