I have a large tsv.gz file (40GB) for which I want to extract a string from an existing variable col3
, store it in a new variable New_var
(placed at the beginning) and save everything the new file.
an example of the data "old_file.tsv.gz"
col1 col2 col3 col4
1 positive 12:1234A 100
2 negative 10:9638B 110
3 positive 5:0987A 100
4 positive 8:5678A 170
Desired data "new_file.tsv.gz"
New_var col1 col2 col3 col4
12 1 positive 12:1234A 100
10 2 negative 10:9638B 110
5 3 positive 5:0987A 100
8 4 positive 8:5678A 170
I am new in bash so I have tried multiple things but I get stuck, I have tried
zcat old_file.tsv.gz | awk '{print New_var=$3,$0 }' | awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'
I think I have multiple problems. {print New_var=$3,$0 }
do create a duplicate of col3
but doesn't rename it. Then when I add the last part of the code awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'
...well nothing comes up (I tried to look if I forget a parenthesis but cannot find the problem).
Also I am not sure if this way is the best way to do it.
Any idea how to make it work?
CodePudding user response:
Make an AWK script in a separate file (for readability), say 1.awk
:
{ if (NR > 1) {
# all data lines
split($3, a, ":");
print a[1], $1, $3, $3, $4;
} else {
# header line
print "new_var", $1, $2, $3, $4;
}
}
Now process the input (say 1.csv.gz
) with the AWK file:
zcat 1.csv.gz | awk -f 1.awk | gzip -c > 1_new.csv.gz
CodePudding user response:
I suggest to use one tab (\t
) and :
as input field separator:
awk 'BEGIN { FS="[\t:]"; OFS="\t" }
NR==1 { $1="New_var" OFS $1 }
NR>1 { $0=$3 OFS $0 }
{ print }'
As one line:
awk 'BEGIN{ FS="[\t:]"; OFS="\t" } NR==1{ $1="New_var" OFS $1 } NR>1{ $0=$3 OFS $0 } { print }'
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR