I have a problem on comparing two files of different no. of lines and use info from a file to update the other. I have tried various key examples that I found online but none seems to work.
Hopefully you could help me with this one.
I have two files:
$ cat 1.txt
>01234
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>13920
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
(>1000 lines)
$ cat 2.txt
hbcd/efgh/z-01234/2000
hbcd/efgh/zw-11000/2000
hbcd/efgh/t-13290/2000
...
...
(<1000 lines)
My intention is to have an updated file 1.txt
, with updated lines but also to keep the lines that are not matched in file 2.txt
, so could be saved as a new file as follows:
$ cat 3.txt
>abcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>abcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
I have tried something like this: awk 'NR==FNR{a[$0]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 1.txt 2.txt > 3.txt
or like this (to search based on the substring in 1.txt): awk 'NR==FNR{a[substr($0,2,5)]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 2.txt 1.txt > 3.txt
but the I get mixed up lines in the output file, something like this (or even without the lines from 2.txt, respectively):
01234 17321 SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA abcd/efgh/t-13290/2000 NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
I haven't used awk for very long time and I'm not sure how the arrays and keys work.
Update: I have tried to write an awk script, to do the above. The condition to check them works but somehow I still have a problem with writing the lines from 1.txt that don't match the ones from 2.txt.
BEGIN{
i = 0;
j = 0;
k = 0;
maxi = 0;
maxj = 0;
maxk = 0;
FS = "\/";
}
FILENAME == ARGV[1]{
header1=substr($0,1,1);
if(header1==">"){
maxi;
seqcode1[maxi]=substr($0,2,5);
# printf("%s\n",seqcode1[maxi]);
}
else if(header1!=">"){
maxk;
seqFASTA[maxk]=$0;
# print seqFASTA[maxk];
}
}
FILENAME == ARGV[2]{
header2=substr($0,1,1);
if(header2=="h"){
maxj;
wholename[maxj]=$0;
seqcode2[maxj]=substr($3,4,5);
# printf("%s\n",seqcode2[maxj]);
}
}
END{
for(i=1;i<=maxi=maxk;i ){
for(j=1;j<=maxj;j ){
if(seqcode1[i] == seqcode2[j]) {
printf("%s %s %s\n",seqcode1[i],seqcode2[j],wholename[j]);
}
else
print seqcode1[i];
print seqFASTA[k];
}
}
}
I think the problem may be with declaring seqFASTA but I'm not sure where.
Thank you very much! M.
CodePudding user response:
I'm assuming 13920
should be 13290
in 1.txt
.
$ awk 'NR==FNR{split($0, a, "/"); sub(/^[^-] -/, "", a[3]); map[a[3]]=$0; next}
(k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt
>hbcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>hbcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
Here are some alternate solutions:
# with GNU awk
awk 'NR==FNR{match($0, /-([0-9] )/, a); map[a[1]]=$0; next}
(k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt
# assuming '/' and '-' will always be similar to given sample
awk -F'[/-]' 'NR==FNR{map[$4]=">"$0; next}
$2 in map{$0 = map[$2]} 1' 2.txt FS='>' 1.txt