I've a file foo.txt
that is encoded with chatset ISO-8859-1
.
I am doing some field extraction with awk
, based in a specific position.
e.g at each line, extract a string that starts in pos 10 with length 5.
That is a simple task, however the below command has different behaviors in different Linux Machines (with different bash/awk versions).
In Machine 1 OK, Machine 2 NOT ok:
cat foo.dat | iconv -f ISO-8859-1 -t UTF-8 | awk '{print substr($0, 10,5)}' > results.utf8
In Machine 1 NOT ok, Machine 2 OK:
cat foo.dat | awk '{print substr($0, 10,5)}' | iconv -f ISO-8859-1 -t UTF-8 > results.utf8
If I run the same command with the same input file, the results are different on each line that contains a "non-utf" char like (a▒c) before the 'cut' position".
No idea where the issue is, linux Kernel, bash or awk version... and specially how to have a common way to extract the desired strings...
CodePudding user response:
No idea where the issue is, linux Kernel, bash or awk version...
The GNU Awk User's Guide - Bytes vs. Characters claims that
The POSIX standard requires that
awk
function in terms of characters, not bytes. Thus ingawk
,length()
,substr()
,split()
,match()
and the other string functions (...) all work in terms of characters in the local character set, and not in terms of bytes. (Not allawk
implementations do so, though).
If above hold true then answer how to have a common way to extract the desired strings is to use AWK
implementation compliant with POSIX (or at least who respect above rule to work in terms of characters, not bytes) and to make sure local character set is as desired.
CodePudding user response:
One option is to use a language which only has one implementation and where you can turn off UTF-8 (or rather, fail to turn it on).
It's not entirely clear what you expect the output to be, but I'm guessing you want something like this:
perl -lne 'print substr($_, 9, 5)' foo.dat | iconv -f ISO-8859-1 -t UTF-8
Notice how the conversion only happens after the extraction, so you can be sure that each byte is exactly one character.