Home > Net >  bash and awk extract string at specific position in non-utf file
bash and awk extract string at specific position in non-utf file

Time:04-30

I've a file foo.txt that is encoded with chatset ISO-8859-1. I am doing some field extraction with awk, based in a specific position. e.g at each line, extract a string that starts in pos 10 with length 5.

That is a simple task, however the below command has different behaviors in different Linux Machines (with different bash/awk versions).

In Machine 1 OK, Machine 2 NOT ok:

cat foo.dat | iconv -f ISO-8859-1 -t UTF-8 | awk '{print substr($0, 10,5)}' > results.utf8

In Machine 1 NOT ok, Machine 2 OK:

cat foo.dat | awk '{print substr($0, 10,5)}' | iconv -f ISO-8859-1 -t UTF-8 > results.utf8

If I run the same command with the same input file, the results are different on each line that contains a "non-utf" char like (a▒c) before the 'cut' position".

No idea where the issue is, linux Kernel, bash or awk version... and specially how to have a common way to extract the desired strings...

CodePudding user response:

No idea where the issue is, linux Kernel, bash or awk version...

The GNU Awk User's Guide - Bytes vs. Characters claims that

The POSIX standard requires that awk function in terms of characters, not bytes. Thus in gawk, length(), substr(), split(), match() and the other string functions (...) all work in terms of characters in the local character set, and not in terms of bytes. (Not all awk implementations do so, though).

If above hold true then answer how to have a common way to extract the desired strings is to use AWK implementation compliant with POSIX (or at least who respect above rule to work in terms of characters, not bytes) and to make sure local character set is as desired.

CodePudding user response:

One option is to use a language which only has one implementation and where you can turn off UTF-8 (or rather, fail to turn it on).

It's not entirely clear what you expect the output to be, but I'm guessing you want something like this:

perl -lne 'print substr($_, 9, 5)' foo.dat | iconv -f ISO-8859-1 -t UTF-8 

Notice how the conversion only happens after the extraction, so you can be sure that each byte is exactly one character.

  • Related