How to substitute columns, rearrange, include and exclude data in a .tsv file within one awk command-CodePudding

I have a .tsv file where I want to 1/ substitute the values of the 6th column onwards until the last column, 2/ exclude the 4th and 5th columns, 3/ rearrange, i.e. swap the 3rd column for the 2nd column and lastly include a column between the 2nd and 3rd columns where the value is the number 0 for all observation (lines).

1/ the values from the 6th column onwards can either be 0,1 or 2. What I want to do is: If the number is 0, substitute by 1\t1 (so it will create another column). If the number is 1, substitute by 1\t2 and if it is 2, substitute by 2/t2.

I've managed to do this by generating a new .tsv with only the values of the 6th column onwards, an then using the commands in awk below:

awk '{gsub(/2/,"2\t2");}1' file.txt > file1; awk '{gsub(/1/,"1\t2");}1' file1 > file2; awk '{gsub(/0/,"1\t1");}1' file2 > file3

After that I would work on the 2/ and 3/ problems. I would have a .tsv file generate with awk from the original .tsv file already excluding the 4th and 5th columns and swapping the 3rd and 2nd columns with this command:

awk -v OFS="\t" '{ print $1, $3, $2 }' original_tsv_file.tsv > reordered_tsv_file.tsv

Then, the next step is to include the number 0 as the 3rd column of the new generated reordered_tsv_file.tsv. I used the command below:

sed -i -e 's/^/0\t/' reordered_tsv_file.tsv

This would include the number 0 in front of every new line and now I would just do the awk command from the step before to reorder the columns like this:

awk -v OFS="\t" '{ print $2, $3, $1, $4 }' reordered_tsv_file.tsv > final_columns_to_be_merged_with_file3.tsv

And finally in order to get my desired file I would just paste the final_columns_to_be_merged_with_file3.tsv and the file3 with the command:

paste -d'\t' final_columns_to_be_merged_with_file3.tsv file3 > final_file.tsv

Below is and example of the original_tsv_file.tsv:

chr15   101152646   chr15:101152646:A:G A   G   1   1   0   0
chr15   101152650   chr15:101152650:A:C A   C   1   1   0   0
chr15   101152872   chr15:101152872:G:A G   A   1   1   0   0
chr15   101152923   chr15:101152923:G:A G   A   1   1   0   0
chr15   101152954   chr15:101152954:C:T C   T   0   2   0   0
chr15   101153197   chr15:101153197:G:C G   C   0   2   0   0

And the final_file.tsv:

chr15   chr15:101152646:A:G 0   101152646   1   2   1   2   1   1   1   1
chr15   chr15:101152650:A:C 0   101152650   1   2   1   2   1   1   1   1
chr15   chr15:101152872:G:A 0   101152872   1   2   1   2   1   1   1   1
chr15   chr15:101152923:G:A 0   101152923   1   2   1   2   1   1   1   1
chr15   chr15:101152954:C:T 0   101152954   1   1   2   2   1   1   1   1
chr15   chr15:101153197:G:C 0   101153197   1   1   2   2   1   1   1   1

What I want is a solution or advice on how can I make this process in a one piece of coding instead of using 5 different commands and tons of generated files in order to get the final result. All the commands above I managed to get it and modifying from the internet, but the I don't; have the knowledge yet to improve this.

Thanks for any help.

CodePudding user response：

you can try it

awk 'BEGIN{OFS="\t"}
     {printf "%s%s%s%s0%s%s", $1, OFS, $3, OFS, OFS, $2; 
      for(i=6;i<=NF;  i){
          if($i==0) printf OFS"1"OFS"1"; 
          else if($i==1) printf OFS"1"OFS"2"; 
          else printf OFS"2"OFS"2"
      } printf ORS}
' original_tsv_file.tsv > final_file.tsv

you get in final_file.tsv

chr15   chr15:101152646:A:G 0   101152646   1   2   1   2   1   1   1   1
chr15   chr15:101152650:A:C 0   101152650   1   2   1   2   1   1   1   1
chr15   chr15:101152872:G:A 0   101152872   1   2   1   2   1   1   1   1
chr15   chr15:101152923:G:A 0   101152923   1   2   1   2   1   1   1   1
chr15   chr15:101152954:C:T 0   101152954   1   1   2   2   1   1   1   1
chr15   chr15:101153197:G:C 0   101153197   1   1   2   2   1   1   1   1

CodePudding user response：

Assumptions:

OP wants all output columns delimited by a tab (\t) (expected output appears to be fixed width, but OP's code mentions OFS="\t")
columns 6 through EOL contain only one of 3 values: 0 or 1 or 2

One awk idea:

awk '
BEGIN { OFS="\t" }
      { outline= $1 OFS $3 OFS "0" OFS $2
        for (i=6;i<=NF;i  )
                 if ($i==0) outline=outline OFS 1 OFS 1
            else if ($i==1) outline=outline OFS 1 OFS 2
            else            outline=outline OFS 2 OFS 2
        print outline
      }
' file.txt

A slightly convoluted idea using a couple ternary operations:

awk '
BEGIN { OFS="\t" }
      { outline= $1 OFS $3 OFS "0" OFS $2
        for (i=6;i<=NF;i  )
            outline=outline OFS \
                    (($i==0) ? 1 OFS 1 : (($i==1) ? 1 OFS 2 : 2 OFS 2))
        print outline
      }
' file.txt

Both of these generate:

chr15   chr15:101152646:A:G     0       101152646       1       2       1       2       1       1       1    1
chr15   chr15:101152650:A:C     0       101152650       1       2       1       2       1       1       1    1
chr15   chr15:101152872:G:A     0       101152872       1       2       1       2       1       1       1    1
chr15   chr15:101152923:G:A     0       101152923       1       2       1       2       1       1       1    1
chr15   chr15:101152954:C:T     0       101152954       1       1       2       2       1       1       1    1
chr15   chr15:101153197:G:C     0       101153197       1       1       2       2       1       1       1    1