Home > front end >  AWK - compare two 1-column files for matching strings then write a file with updated info
AWK - compare two 1-column files for matching strings then write a file with updated info

Time:03-27

I have a problem on comparing two files of different no. of lines and use info from a file to update the other. I have tried various key examples that I found online but none seems to work.

Hopefully you could help me with this one.

I have two files:

$ cat 1.txt
>01234
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>13920
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
(>1000 lines)


$ cat 2.txt
hbcd/efgh/z-01234/2000
hbcd/efgh/zw-11000/2000
hbcd/efgh/t-13290/2000
...
...
(<1000 lines)

My intention is to have an updated file 1.txt, with updated lines but also to keep the lines that are not matched in file 2.txt, so could be saved as a new file as follows:

$ cat 3.txt
>abcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>abcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...

I have tried something like this: awk 'NR==FNR{a[$0]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 1.txt 2.txt > 3.txt

or like this (to search based on the substring in 1.txt): awk 'NR==FNR{a[substr($0,2,5)]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 2.txt 1.txt > 3.txt

but the I get mixed up lines in the output file, something like this (or even without the lines from 2.txt, respectively):

01234 17321 SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA abcd/efgh/t-13290/2000 NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN

I haven't used awk for very long time and I'm not sure how the arrays and keys work.

Update: I have tried to write an awk script, to do the above. The condition to check them works but somehow I still have a problem with writing the lines from 1.txt that don't match the ones from 2.txt.

BEGIN{
    i = 0;
    j = 0;
    k = 0;
    maxi = 0;
    maxj = 0;
    maxk = 0;
    FS = "\/";
}

FILENAME == ARGV[1]{
    header1=substr($0,1,1);
    if(header1==">"){
          maxi;
        seqcode1[maxi]=substr($0,2,5);
#       printf("%s\n",seqcode1[maxi]);
    }
    else if(header1!=">"){
          maxk;
        seqFASTA[maxk]=$0;
#       print seqFASTA[maxk];
    }
}

FILENAME == ARGV[2]{
    header2=substr($0,1,1);
    if(header2=="h"){
          maxj;
        wholename[maxj]=$0;
        seqcode2[maxj]=substr($3,4,5);
#       printf("%s\n",seqcode2[maxj]);
    }
}

END{
    for(i=1;i<=maxi=maxk;i  ){
      for(j=1;j<=maxj;j  ){
        if(seqcode1[i] == seqcode2[j]) {
            printf("%s %s %s\n",seqcode1[i],seqcode2[j],wholename[j]);
        }
        else
          print seqcode1[i];
          print seqFASTA[k];
        }
    }
}

I think the problem may be with declaring seqFASTA but I'm not sure where.

Thank you very much! M.

CodePudding user response:

I'm assuming 13920 should be 13290 in 1.txt.

$ awk 'NR==FNR{split($0, a, "/"); sub(/^[^-] -/, "", a[3]); map[a[3]]=$0; next}
       (k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt
>hbcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>hbcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN

Here are some alternate solutions:

# with GNU awk
awk 'NR==FNR{match($0, /-([0-9] )/, a); map[a[1]]=$0; next}
     (k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt

# assuming '/' and '-' will always be similar to given sample
awk -F'[/-]' 'NR==FNR{map[$4]=">"$0; next}
              $2 in map{$0 = map[$2]} 1' 2.txt FS='>' 1.txt
  • Related