How to replace "00" with Na excluding first row & first column using bash in comma separat-CodePudding

I'm working with GWAS data, My data looks like this:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,00,AG,GT,AK,00
32,AG,GG,AA,00,AT
100,TT,AA,00,AG,AA       
3,GG,AG,00,GT,GG

Desired Output:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,N/A,AG,GT,AK,N/A
32,AG,GG,AA,N/A,AT
100,TT,AA,N/A,AG,AA       
3,GG,AG,N/A,GT,GG

sed '1!s~00~N/A~g' allSNIPsFinaldata.csv

The above command excludes the first row but not the first column as a result I got IID Values 100, 200, and 300 as 1N/A, 2N/A, and 3N/A. Can anyone please help "how to exclude the first row & First Column as well and perform the above operation.

I tried this:

awk 'NR>1{$0=$0","; gsub(/,00,/,",NA,"); sub(/,$/,"")} 1' file

Note: Here in above I think the command took 00 as int but it is a "00" string and as a result, the above command does not replace "00" with "NA". can anyone please help with the command which replaces "00" with "NA"

CodePudding user response：

You can use

sed -E '1!{:a;s~^([^,]*,.*)00~\1N/A~;ta;}' file > newfile

Details:

-E - enables POSIX ERE syntax
1! - match on all lines but the first
:a - set an a label
s~^([^,]*,.*)00~\1N/A~ - find and capture into Group 1 any zero or more chars other than a comma (at the string start) and a comma and then any text, and then just consume 00, and replace the match with Group 1 contents
ta - upon a successful replacement go back to a label position in the string.

See the online demo:

#!/bin/bash
s='IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,00,AG,GT,AK,00
32,AG,GG,AA,00,AT
100,TT,AA,00,AG,AA       
3,GG,AG,00,GT,GG'

sed -E '1!{:a;s~^([^,]*,.*)00~\1N/A~;ta;}' <<< "$s"

Output:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,N/A,AG,GT,AK,N/A
32,AG,GG,AA,N/A,AT
100,TT,AA,N/A,AG,AA       
3,GG,AG,N/A,GT,GG

CodePudding user response：

One awk idea:

awk '
BEGIN { FS=OFS="," }
NR>1  { for (i=2;i<=NF;i  )      # skip 1st line; loop through fields 2 to NF
            if ($i == "00")      # if field = "00" then ...
               $i="N/A"          # replace with "N/A"
      }
1                                # print current line
' file

Or as a one-liner:

awk 'BEGIN {FS=OFS=","} NR>1{for (i=2;i<=NF;i  ) if ($i == "00") $i="N/A"}1' file

This generates:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,N/A,AG,GT,AK,N/A
32,AG,GG,AA,N/A,AT
100,TT,AA,N/A,AG,AA
3,GG,AG,N/A,GT,GG