I'm trying to join two files which are already sorted
File1
70 CBLB Cbl proto-oncogene B
70 HOXC11 centrosomal protein 57
70 CHD4 chromodomain helicase
70 FANCF FA complementation
70 LUZP2 leucine zipper protein 2
File2
0.700140820757797 ELAVL1
0.700229616476825 HOXC11
0.700328646327188 CHD4
0.700328951649384 LUZP2
Output
Gene Symbol Gene Description Target Score mirDB Target Score Diana
HOXC11 centrosomal protein 57 70 0.700229616476825
CHD4 chromodomain helicase 70 0.700328646327188
LUZP2 leucine zipper protein 2 70 0.700328951649384
To perform this task, I have tried with this script, but it returns an empty file
join -j 2 -o 1.1,1.2,1.3,1.4,2.4 File1 File2 | column -t | sed '1i Gene Symbol, Gene
Description, Target Score mirDB, Target Score Diana' > Output
Any help with awk or join commands requested.
CodePudding user response:
You can try this awk
$ awk 'BEGIN {OFS="\t"; print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"} NR==FNR{array[$2]=$1; next} $0!~array[$2]{print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]}' file2 file1
Gene Symbol Gene Description Target Score mirDB Target Score Diana
HOX11 centrosomal protein 57 70 0.700229616476825
CHD4 chromodomain helicase 70 0.700328646327188
LUZP2 leucine zipper protein 2 70 0.700328951649384
BEGIN {
OFS="\t"
print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
} NR==FNR {
array[$2]=$1
next
} $0!~array[$2] {
print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]
}
CodePudding user response:
UPDATE: updated awk
to remove Windows line endings (\r
) as this popped up as an issue during comments/other-question with OP
Issues:
- OP's current code needs to pre-sort both files before calling
join
- due to the variable number of white space delimited columns in
File1
it's going to be hard (impossible?) to getjoin
to generate a format that won't be scrambled by the follow-oncolumn
call column
cannot distinguish between spaces used as field delimiters vs spaces as part of a field
Due to these issues I think an awk
solution, combined with column
for an 'easy' reformat, is easier to implement and understand, eg:
awk '
BEGIN { OFS="|" # "|" will be used as the input delimiter for a follow-on "column" call
print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
}
{ sub(/\r/,"") } # remove Windows line ending "\r" for all lines in all files
FNR==NR { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
for (i=3;i<=NF;i ) { # pull fields #3 to #NF into a single variable
lastF=lastF pfx $i
pfx=" "
}
print $2, lastF, $1, gene[$2]
}
' File2 File1
This generates:
Gene Symbol|Gene Description|Target Score mirDB|Target Score Diana
HOXC11|centrosomal protein 57|70|0.700229616476825
CHD4|chromodomain helicase|70|0.700328646327188
LUZP2|leucine zipper protein 2|70|0.700328951649384
While it is possible to add some more code so that awk
prints the output in 'pretty' columns, I've opted for a simpler approach of letting column
do the extra work:
awk '
BEGIN { OFS="|"
print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
}
{ sub(/\r/,"") } # remove Windows line ending "\r" for all lines in all files
FNR==NR { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
for (i=3;i<=NF;i ) {
lastF=lastF pfx $i
pfx=" "
}
print $2, lastF, $1, gene[$2]
}
' File2 File1 | column -s'|' -t
This generates:
Gene Symbol Gene Description Target Score mirDB Target Score Diana
HOXC11 centrosomal protein 57 70 0.700229616476825
CHD4 chromodomain helicase 70 0.700328646327188
LUZP2 leucine zipper protein 2 70 0.700328951649384
CodePudding user response:
This might work for you (GNU sed, join and column):
( echo 'Gene Symbol@Gene Description@Target Score mirDB@Target Score Diana';
join -j2 -t@ --no -o 0,1.3,1.1,2.1 <(sed 's/ /@/;s//@/' file1) <(sed 's/ /@/' file2) ) |
column -s@ -t
Formulate the final headings, join the two input files and pass the total output to the column command that tabulates the result.
N.B. The headings are delimited by the @
an arbitrary character not found in the headings or joined files. The input files are amended so that their field delimiters match those of the headings and the column command uses the same delimiter to tabulate the final result. The --no
(short for --nocheck-order
) prevents warning messages.