While joining two files the first gene gets lost-CodePudding

I have a gene file with the list of genes:

Gene
ABC
ASF
AGH

And a file that contains different statistics:

Gene    total_coverage  average_coverage    S12_total_cvg   S12_mean_cvg    S12_granular_Q1 S12_granular_median S12_granular_Q3 S12_%_above_10  S12_%_above_20  S12_%_above_50
ABC 68264   74.36   68264   74.36   51  68  92  100.0   99.8    76.0
ASF 653934  232.14  653934  232.14  178 216 265 100.0   100.0   100.0
AGH 653934  232.14  653934  232.14  178 216 265 100.0   100.0   100.0
ORD 653934  232.14  653934  232.14  178 216 265 100.0   100.0   100.0
NOC 559425  248.63  559425  248.63  174 246 316 100.0   100.0   100.0

When I try to get the information of the genes, the first gene in alfabetical order goes on the top like this:

ABC
Gene    total_coverage  average_coverage    S12_%_above_20  S12_%_above_50
ASF 653934  232.14  100.0   100.0
AGH 653934  232.14  100.0   100.0

Why is it like this? This is my code:

rule join:
    input:
        stat="{sample}.stats",
        genes="gene_list.txt"
    output: 
        "{sample}.stats.output"
    shell:
        """
        join  --header -1 1 -2 1 -t $'\t' <(awk 'NR==1; NR > 1 {{print $0 | "sort -k1,1"}}' {input.stat}) <(sort -k1,1 {input.genes}) | cut -d$'\t' -f1-3,10-11 >> {output}
        """

CodePudding user response：

According to join --help:

Usage: join [OPTION]... FILE1 FILE2
...
When FILE1 or FILE2 (not both) is -, read standard input.
...
  -j FIELD          equivalent to '-1 FIELD -2 FIELD'
...

We can try something like this:

$ cat a
Gene
ABC
ASF
AGH

$ cat b
Gene    total_coverage  average_coverage    S12_total_cvg   S12_mean_cvg    S12_granular_Q1 S12_granular_median S12_granular_Q3 S12_%_above_10  S12_%_above_20  S12_%_above_50
ABC 68264   74.36   68264   74.36   51  68  92  100.0   99.8    76.0
ASF 653934  232.14  653934  232.14  178 216 265 100.0   100.0   100.0
AGH 653934  232.14  653934  232.14  178 216 265 100.0   100.0   100.0
ORD 653934  232.14  653934  232.14  178 216 265 100.0   100.0   100.0
NOC 559425  248.63  559425  248.63  174 246 316 100.0   100.0   100.0

$ join  --header -j 1 a - < <(awk '{print $1, $2, $3}' b)
Gene total_coverage average_coverage
ABC 68264 74.36
ASF 653934 232.14
AGH 653934 232.14

Or just awk:

$ awk '/Gene|ABC|ASF|AGH/{print $1, $2, $3}' b
Gene total_coverage average_coverage
ABC 68264 74.36
ASF 653934 232.14
AGH 653934 232.14

With tabs:

$ awk '/Gene|ABC|ASF|AGH/{printf "%s\t%-10s\t%s\n", $1, $2, $3}' b
Gene    total_coverage  average_coverage
ABC     68264           74.36
ASF     653934          232.14
AGH     653934          232.14

CodePudding user response：

Using awk

$ awk 'NR==FNR {a[$1]=$2FS$3FS$(NF-1)FS$NF; next} {print $0,a[$1]}' stats.txt gene.list
Gene total_coverage average_coverage S12_%_above_20 S12_%_above_50
ABC 68264 74.36 99.8 76.0
ASF 653934 232.14 100.0 100.0
AGH 653934 232.14 100.0 100.0