I have a gene file with the list of genes:
Gene
ABC
ASF
AGH
And a file that contains different statistics:
Gene total_coverage average_coverage S12_total_cvg S12_mean_cvg S12_granular_Q1 S12_granular_median S12_granular_Q3 S12_%_above_10 S12_%_above_20 S12_%_above_50
ABC 68264 74.36 68264 74.36 51 68 92 100.0 99.8 76.0
ASF 653934 232.14 653934 232.14 178 216 265 100.0 100.0 100.0
AGH 653934 232.14 653934 232.14 178 216 265 100.0 100.0 100.0
ORD 653934 232.14 653934 232.14 178 216 265 100.0 100.0 100.0
NOC 559425 248.63 559425 248.63 174 246 316 100.0 100.0 100.0
When I try to get the information of the genes, the first gene in alfabetical order goes on the top like this:
ABC
Gene total_coverage average_coverage S12_%_above_20 S12_%_above_50
ASF 653934 232.14 100.0 100.0
AGH 653934 232.14 100.0 100.0
Why is it like this? This is my code:
rule join:
input:
stat="{sample}.stats",
genes="gene_list.txt"
output:
"{sample}.stats.output"
shell:
"""
join --header -1 1 -2 1 -t $'\t' <(awk 'NR==1; NR > 1 {{print $0 | "sort -k1,1"}}' {input.stat}) <(sort -k1,1 {input.genes}) | cut -d$'\t' -f1-3,10-11 >> {output}
"""
CodePudding user response:
According to join --help
:
Usage: join [OPTION]... FILE1 FILE2
...
When FILE1 or FILE2 (not both) is -, read standard input.
...
-j FIELD equivalent to '-1 FIELD -2 FIELD'
...
We can try something like this:
$ cat a
Gene
ABC
ASF
AGH
$ cat b
Gene total_coverage average_coverage S12_total_cvg S12_mean_cvg S12_granular_Q1 S12_granular_median S12_granular_Q3 S12_%_above_10 S12_%_above_20 S12_%_above_50
ABC 68264 74.36 68264 74.36 51 68 92 100.0 99.8 76.0
ASF 653934 232.14 653934 232.14 178 216 265 100.0 100.0 100.0
AGH 653934 232.14 653934 232.14 178 216 265 100.0 100.0 100.0
ORD 653934 232.14 653934 232.14 178 216 265 100.0 100.0 100.0
NOC 559425 248.63 559425 248.63 174 246 316 100.0 100.0 100.0
$ join --header -j 1 a - < <(awk '{print $1, $2, $3}' b)
Gene total_coverage average_coverage
ABC 68264 74.36
ASF 653934 232.14
AGH 653934 232.14
Or just awk:
$ awk '/Gene|ABC|ASF|AGH/{print $1, $2, $3}' b
Gene total_coverage average_coverage
ABC 68264 74.36
ASF 653934 232.14
AGH 653934 232.14
With tabs:
$ awk '/Gene|ABC|ASF|AGH/{printf "%s\t%-10s\t%s\n", $1, $2, $3}' b
Gene total_coverage average_coverage
ABC 68264 74.36
ASF 653934 232.14
AGH 653934 232.14
CodePudding user response:
Using awk
$ awk 'NR==FNR {a[$1]=$2FS$3FS$(NF-1)FS$NF; next} {print $0,a[$1]}' stats.txt gene.list
Gene total_coverage average_coverage S12_%_above_20 S12_%_above_50
ABC 68264 74.36 99.8 76.0
ASF 653934 232.14 100.0 100.0
AGH 653934 232.14 100.0 100.0