Join two files by matching a specific column-CodePudding

I'm trying to join two files which are already sorted

File1

70 CBLB Cbl proto-oncogene B
70 HOXC11 centrosomal protein 57
70 CHD4 chromodomain helicase
70 FANCF FA complementation
70 LUZP2 leucine zipper protein 2

File2

0.700140820757797 ELAVL1
0.700229616476825 HOXC11
0.700328646327188 CHD4
0.700328951649384 LUZP2

Output

Gene Symbol  Gene Description         Target Score mirDB   Target Score Diana
HOXC11       centrosomal protein 57   70                   0.700229616476825
CHD4         chromodomain helicase    70                   0.700328646327188
LUZP2        leucine zipper protein 2 70                   0.700328951649384

To perform this task, I have tried with this script, but it returns an empty file

join -j 2 -o 1.1,1.2,1.3,1.4,2.4 File1 File2 | column -t | sed '1i Gene Symbol, Gene 
Description, Target Score mirDB, Target Score Diana' > Output

Any help with awk or join commands requested.

CodePudding user response：

You can try this awk

$ awk 'BEGIN {OFS="\t"; print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"} NR==FNR{array[$2]=$1; next} $0!~array[$2]{print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]}' file2 file1

Gene Symbol     Gene Description        Target Score mirDB      Target Score Diana
HOX11           centrosomal protein 57          70              0.700229616476825
CHD4            chromodomain helicase           70              0.700328646327188
LUZP2           leucine zipper protein  2       70              0.700328951649384

BEGIN {
    OFS="\t" 
    print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
} NR==FNR {
    array[$2]=$1
    next
} $0!~array[$2] {
    print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]
}

CodePudding user response：

UPDATE: updated awk to remove Windows line endings (\r) as this popped up as an issue during comments/other-question with OP

Issues:

OP's current code needs to pre-sort both files before calling join
due to the variable number of white space delimited columns in File1 it's going to be hard (impossible?) to get join to generate a format that won't be scrambled by the follow-on column call
column cannot distinguish between spaces used as field delimiters vs spaces as part of a field

Due to these issues I think an awk solution, combined with column for an 'easy' reformat, is easier to implement and understand, eg:

awk '
BEGIN      { OFS="|"                              # "|" will be used as the input delimiter for a follow-on "column" call
             print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
           }
           { sub(/\r/,"") }                       # remove Windows line ending "\r" for all lines in all files
FNR==NR    { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
             for (i=3;i<=NF;i  ) {                # pull fields #3 to #NF into a single variable 
                 lastF=lastF pfx $i
                 pfx=" "
             }
             print $2, lastF, $1, gene[$2]
           }
' File2 File1

This generates:

Gene Symbol|Gene Description|Target Score mirDB|Target Score Diana
HOXC11|centrosomal protein 57|70|0.700229616476825
CHD4|chromodomain helicase|70|0.700328646327188
LUZP2|leucine zipper protein 2|70|0.700328951649384

While it is possible to add some more code so that awk prints the output in 'pretty' columns, I've opted for a simpler approach of letting column do the extra work:

awk '
BEGIN      { OFS="|" 
             print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
           }
           { sub(/\r/,"") }                       # remove Windows line ending "\r" for all lines in all files
FNR==NR    { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
             for (i=3;i<=NF;i  ) {
                 lastF=lastF pfx $i
                 pfx=" "
             }
             print $2, lastF, $1, gene[$2]
           }
' File2 File1 | column -s'|' -t

This generates:

Gene Symbol  Gene Description          Target Score mirDB  Target Score Diana
HOXC11       centrosomal protein 57    70                  0.700229616476825
CHD4         chromodomain helicase     70                  0.700328646327188
LUZP2        leucine zipper protein 2  70                  0.700328951649384

CodePudding user response：

This might work for you (GNU sed, join and column):

( echo 'Gene Symbol@Gene Description@Target Score mirDB@Target Score Diana';
join -j2 -t@ --no -o 0,1.3,1.1,2.1 <(sed 's/ /@/;s//@/' file1) <(sed 's/ /@/' file2) ) |
column -s@ -t

Formulate the final headings, join the two input files and pass the total output to the column command that tabulates the result.

N.B. The headings are delimited by the @ an arbitrary character not found in the headings or joined files. The input files are amended so that their field delimiters match those of the headings and the column command uses the same delimiter to tabulate the final result. The --no (short for --nocheck-order) prevents warning messages.