create new column containing a substring of an existing column in data using bash-CodePudding

I have a large tsv.gz file (40GB) for which I want to extract a string from an existing variable col3, store it in a new variable New_var (placed at the beginning) and save everything the new file. an example of the data "old_file.tsv.gz"

col1  col2  col3  col4
1  positive  12:1234A  100
2  negative  10:9638B  110
3  positive  5:0987A  100
4  positive  8:5678A  170

Desired data "new_file.tsv.gz"

New_var  col1  col2  col3  col4
12  1  positive  12:1234A  100
10  2  negative  10:9638B  110
5  3  positive  5:0987A  100
8  4  positive  8:5678A  170

I am new in bash so I have tried multiple things but I get stuck, I have tried

zcat old_file.tsv.gz | awk '{print New_var=$3,$0 }' | awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'

I think I have multiple problems. {print New_var=$3,$0 } do create a duplicate of col3 but doesn't rename it. Then when I add the last part of the code awk '$1 ~ /^[0-9]:/{print $0 | (gzip -c > new_file.tsv.gz) }'...well nothing comes up (I tried to look if I forget a parenthesis but cannot find the problem). Also I am not sure if this way is the best way to do it. Any idea how to make it work?

CodePudding user response：

Make an AWK script in a separate file (for readability), say 1.awk:

{ if (NR > 1) { 
    # all data lines 
    split($3, a, ":");  
    print a[1], $1, $3, $3, $4; 
  } else {
    # header line
    print "new_var", $1, $2, $3, $4;
  } 
}

Now process the input (say 1.csv.gz) with the AWK file:

zcat 1.csv.gz | awk -f 1.awk | gzip -c > 1_new.csv.gz

CodePudding user response：

I suggest to use one tab (\t) and : as input field separator:

awk 'BEGIN { FS="[\t:]"; OFS="\t" }
     NR==1 { $1="New_var" OFS $1 }
     NR>1  { $0=$3 OFS $0 }
     { print }'

As one line:

awk 'BEGIN{ FS="[\t:]"; OFS="\t" } NR==1{ $1="New_var" OFS $1 } NR>1{ $0=$3 OFS $0 } { print }'

See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR