How to replace 00 with Na excluding first row & first column using bash in comma separated file-CodePudding

I'm working with GWAS data, My data looks like this:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,00,AG,GT,AK,00
32,AG,GG,AA,00,AT
100,TT,AA,00,AG,AA       
3,GG,AG,00,GT,GG

Desired Output:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,N/A,AG,GT,AK,N/A
32,AG,GG,AA,N/A,AT
100,TT,AA,N/A,AG,AA       
3,GG,AG,N/A,GT,GG

Here I'm trying to replace "00" with "N/A", but since I have 00 in the first row & First Column (IID), the command I used:

sed '1!s~00~N/A~g' allSNIPsFinaldata.csv

The above command excludes the first row but not the first column as a result I got IID Values 100, 200, and 300 as 1N/A, 2N/A, and 3N/A. Can anyone please help "how to exclude the first row & First Column as well and perform the above operation. please help

CodePudding user response：

With your shown samples in GNU awk using its gensub function, please try following awk program.

awk '
BEGIN{
  FS=OFS=","
}
FNR==1{
  print
  next
}
{
  secondPart=gensub(/^[^,]*,(.*)/,"\\1","g")
  sub(/^00,/,"N/A,",secondPart)
  gsub(/,00,/,",N/A,",secondPart)
  sub(/,00$/,",N/A",secondPart)
  print $1 OFS secondPart
}
'  Input_file

CodePudding user response：

Assuming the columns are separated by space characters such as whitespace or tab character, would you please try:

sed -E '1!s~([[:space:]])00([[:space:]]|$)~\1N/A\2~g' allSNIPsFinaldata.csv

The address 1! skips the 1st row.
The regex ([[:space:]])00([[:space:]]|$) matches the 00 string preceded by a space character (it prevents to match the 1st column) and followed by a space character or the end of the line.

CodePudding user response：

An awk:

$ awk '
BEGIN {
    FS=OFS=","          # set field delimiters to a comma
}
FNR>1 {                 # process records after the first
    for(i=1;i<=NF;i  )  # iterate all fields (maybe start from 2nd?)
        if($i=="00")    # if field is 00
            $i="N/A"    # replace
}1' file                # output

Output:

IID,kgp11004425,rs11274005,kgp183005,rs746410036,kgp7979600
1,N/A,AG,GT,AK,N/A
32,AG,GG,AA,N/A,AT
100,TT,AA,N/A,AG,AA       
3,GG,AG,N/A,GT,GG

CodePudding user response：

If you want to repolace only 00 in other columns you have to add a delimiter (I am assuming space in my command) in your pattern: sed -i 's~ 00 ~ N/A ~g' allSNIPsFinaldata.csv

CodePudding user response：

This might work for you (GNU sed):

sed -E '1!{s/,00(,|$)/,N\/A\1/g;s//,N\/A\1/g}' file

If not the first line and , followed by 00 followed by , or end-of-line, replace the 00 by N/A and other parts of the match remain unchanged.

This substitution is global but needs to be implemented twice because the patterns may overlap.