awk match three columns from two files and append matching lines to a new file-CodePudding

There are many posts that are similar to this one. Hours into trouble shooting this I'm desperate, as it seems like it should be simple.

I have one file that looks like this:

tig00000005 15310   16162   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like
tig00000005 23339   23974   XP_012152584.1  NW_003797083.1  LOC100878991    PREDICTED: LOW QUALITY PROTEIN
tig00000005 24600   25138   XP_012143166.1  NW_003797196.1  LOC100881279    PREDICTED: ankyrin-2 isoform X2
tig00000005 2685    4511    XP_012144644.1  NW_003797249.1  LOC105662970    PREDICTED: fibrinogen alpha chain-like isoform X2
tig00000005 28923   29432   XP_012148395.1  NW_003797444.1  LOC100881617    PREDICTED: eukaryotic translation initiation factor 4 gamma 3-like isoform X12
tig00000005 32415   34324   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like

And a second file that looks like this:

tig00000005 maker   gene    15310   16162   .       .   ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 maker   gene    16764   17237   .       .   ID=snap_masked-tig00000005-processed-gene-0.3;Name=snap_masked-tig00000005-processed-gene-0.3
tig00000005 maker   gene    23339   23974   .       .   ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 maker   gene    24600   25138   .   -   .   ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
tig00000005 maker   gene    25472   26900   .       .   ID=snap_masked-tig00000005-processed-gene-0.5;Name=snap_masked-tig00000005-processed-gene-0.5

I would like to match the 1, 2, and 3 column in the first file with the 1, 4, and 5 in the second, and if they match, append the second file's data to the first file, like so:

tig00000005 15310   16162   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like tig00000005 maker   gene    15310   16162   .       .   ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2

Some example code that has not worked:

awk 'OFS="\t"; FS="\t"; NR==FNR{a[$1,$2,$3]=$0; next} (($1,$4,$5) in a){print $0,a[$0]}'  file 1 file 2

awk 'OFS="\t"; FS="\t"; NR==FNR{a[$1,$2,$3]=($1,$4,$5)} {print $0,a[$0]}' file 1 file 2

First outputs a file with every line from file 1 followed by (not appended) file 2, second code throws errors related to the = function. I've tried any permutation of this I can imagine. Thank you for any help you can provide

CodePudding user response：

A couple small changes to OP's first awk script:

# old:

awk 'OFS="\t"; FS="\t"; NR==FNR{a[$1,$2,$3]=$0; next} (($1,$4,$5) in a){print $0,a[$0]}' file1 file2

# new - add BEGIN block, modify print statement:

awk 'BEGIN {FS=OFS="\t"} NR==FNR{a[$1,$2,$3]=$0; next} (($1,$4,$5) in a){print a[$1,$4,$5],$0}' file1 file2

The modified awk script generates:

tig00000005 15310   16162   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like tig00000005 maker   gene    15310   16162   .       .   ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339   23974   XP_012152584.1  NW_003797083.1  LOC100878991    PREDICTED: LOW QUALITY PROTEIN tig00000005 maker   gene    23339   23974   .       .   ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600   25138   XP_012143166.1  NW_003797196.1  LOC100881279    PREDICTED: ankyrin-2 isoform X2 tig00000005 maker   gene    24600   25138   .   -   .   ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10

CodePudding user response：

Like this?

awk 'NR==FNR{a[$1" "$2" "$3]=$0; next}; {if($1" "$4" "$5 in a){print a[$1" "$4" "$5],$0}}' file1 file2
tig00000005 15310   16162   XP_012153921.1  NW_003797090.1  LOC105664333    PREDICTED: elastin-like tig00000005 maker   gene    15310   16162   .       .   ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
tig00000005 23339   23974   XP_012152584.1  NW_003797083.1  LOC100878991    PREDICTED: LOW QUALITY PROTEIN tig00000005 maker   gene    23339   23974   .       .   ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4
tig00000005 24600   25138   XP_012143166.1  NW_003797196.1  LOC100881279    PREDICTED: ankyrin-2 isoform X2 tig00000005 maker   gene    24600   25138   .   -   .   ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10

To write to a new file just do awk 'NR==FNR{a[$1" "$2" "$3]=$0; next}; {if($1" "$4" "$5 in a){print a[$1" "$4" "$5],$0}}' file1 file2 > file3