I have a very large file like so:
ID Class Values
126 1 332222330442022...
753 1 332222330442022...
119 1 402224220402022...
830 1 002233440232022...
944 1 222222220002022...
The 3rd columns is a string with 50,000 characters. I need to ignore the top line, drop the 2nd column, replace all 3 or 4's in the 3rd colum with 1's and finally print the 3rd column with every charcater seperated by a space.
So the desired output is:
126 1 1 2 2 2 2 1 1 0 1 1 2 0 2 2...
753 1 1 2 2 2 2 1 1 0 1 1 2 0 2 2...
119 1 0 2 2 2 1 2 2 0 1 0 2 0 2 2...
830 0 0 2 2 1 1 1 1 0 2 1 2 0 2 2...
944 2 2 2 2 2 2 2 2 0 0 0 2 0 2 2...
Because the file is so large, it would be good to avoid using split on the 3rd column if possible.
So far, I can achieve everything except printing the 3rd column seperated by a space with the following:
awk -F " " 'NR!= 1 { gsub(3,1,$3); gsub(4,1,$3); printf "%s\t%s\n", $1, $3 }' ./input.txt
I know I can use split() similar to the answer here (Split tab delimited column with space) but I need to print $1 also. Is it possible to separate the 3rd column in the same awk command?
CodePudding user response:
You may use this awk
:
awk -v OFS='\t' 'NR > 1 {
gsub(/[34]/, 1, $3)
gsub(/./, "& ", $3)
sub(/ $/, "", $3)
print $1, $3
}' file
126 1 1 2 2 2 2 1 1 0 1 1 2 0 2 2
753 1 1 2 2 2 2 1 1 0 1 1 2 0 2 2
119 1 0 2 2 2 1 2 2 0 1 0 2 0 2 2
830 0 0 2 2 1 1 1 1 0 2 1 2 0 2 2
944 2 2 2 2 2 2 2 2 0 0 0 2 0 2 2