Home > Back-end >  Join two files by matching a specific column
Join two files by matching a specific column

Time:09-30

I'm trying to join two files which are already sorted

File1

70 CBLB Cbl proto-oncogene B
70 HOXC11 centrosomal protein 57
70 CHD4 chromodomain helicase
70 FANCF FA complementation
70 LUZP2 leucine zipper protein 2

File2

0.700140820757797 ELAVL1
0.700229616476825 HOXC11
0.700328646327188 CHD4
0.700328951649384 LUZP2

Output

Gene Symbol  Gene Description         Target Score mirDB   Target Score Diana
HOXC11       centrosomal protein 57   70                   0.700229616476825
CHD4         chromodomain helicase    70                   0.700328646327188
LUZP2        leucine zipper protein 2 70                   0.700328951649384

To perform this task, I have tried with this script, but it returns an empty file

join -j 2 -o 1.1,1.2,1.3,1.4,2.4 File1 File2 | column -t | sed '1i Gene Symbol, Gene 
Description, Target Score mirDB, Target Score Diana' > Output

Any help with awk or join commands requested.

CodePudding user response:

You can try this awk

$ awk 'BEGIN {OFS="\t"; print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"} NR==FNR{array[$2]=$1; next} $0!~array[$2]{print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]}' file2 file1

Gene Symbol     Gene Description        Target Score mirDB      Target Score Diana
HOX11           centrosomal protein 57          70              0.700229616476825
CHD4            chromodomain helicase           70              0.700328646327188
LUZP2           leucine zipper protein  2       70              0.700328951649384
BEGIN {
    OFS="\t" 
    print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
} NR==FNR {
    array[$2]=$1
    next
} $0!~array[$2] {
    print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]
}

CodePudding user response:

Issues:

  • OP's current code needs to pre-sort both files before calling join
  • due to the variable number of white space delimited columns in File1 it's going to be hard (impossible?) to get join to generate a format that won't be scrambled by the follow-on column call
  • column cannot distinguish between spaces used as field delimiters vs spaces as part of a field

Due to these issues I think an awk solution, combined with column for an 'easy' reformat, is easier to implement and understand, eg:

awk '
BEGIN      { OFS="|";                             # "|" will be used as the input delimiter for a follow-on "column" call
             print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
           }
FNR==NR    { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
             for (i=3;i<=NF;i  ) {                # pull fields #3 to #NF into a single variable 
                 lastF=lastF pfx $i
                 pfx=" "
             }
             print $2, lastF, $1, gene[$2]
           }
' File2 File1 

This generates:

Gene Symbol|Gene Description|Target Score mirDB|Target Score Diana
HOXC11|centrosomal protein 57|70|0.700229616476825
CHD4|chromodomain helicase|70|0.700328646327188
LUZP2|leucine zipper protein 2|70|0.700328951649384

While it is possible to add some more code so that awk prints the output in 'pretty' columns, I've opted for a simpler approach of letting column do the extra work:

awk '
BEGIN      { OFS="|";
             print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
           }
FNR==NR    { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
             for (i=3;i<=NF;i  ) {
                 lastF=lastF pfx $i
                 pfx=" "
             }
             print $2, lastF, $1, gene[$2]
           }
' File2 File1 | column -s'|' -t

This generates:

Gene Symbol  Gene Description          Target Score mirDB  Target Score Diana
HOXC11       centrosomal protein 57    70                  0.700229616476825
CHD4         chromodomain helicase     70                  0.700328646327188
LUZP2        leucine zipper protein 2  70                  0.700328951649384
  • Related