I have text files (tab separated) which have different columns. I need to rename my chromosome_position
column (see MM.beagle.gz file below) since a program that I use don't allow multiple underscores in the chromosome name (causing a parsing issue because NC_044592.1_3795
is not working as a name).
My indexed genome looks like this:
head my.fna.fai
Contains this:
NC_044571.1 115307910 88 80 81
NC_044572.1 151975198 116749435 80 81
NC_044573.1 113180500 270624411 80 81
NC_044574.1 71869398 385219756 80 81
The bealgle file looks like this:
zcat MM.beagle.gz | head | cut -f 1-3
Which gives:
marker allele1 allele2
NC_044571.1_3795 G T
NC_044573.1_3796 G T
NC_044572.1_3801 T C
NC_044574.1_3802 G A
In R I can get the chromosome and position:
beag = read.table("MM.beagle.gz", header = TRUE)
chr=gsub("_\\d $", "", beag$marker)
pos=gsub("^[A-Z]*_[0-9]*.[0-9]_", "", beag$marker)
But I'm not able to rename the beagle file in-place. I'd like to rename all contigs in the .fai
file from 1:nrow(my.fna.fai)
and match it to the beagle file.
So in the end the .fai
should look like:
head my.fna.fai
Desired output:
1 115307910 88 80 81
2 151975198 116749435 80 81
3 113180500 270624411 80 81
4 71869398 385219756 80 81
And the beagle file:
zcat MM.beagle.gz | head | cut -f 1-3
Would give:
marker allele1 allele2
1_3795 G T
3_3796 G T
2_3801 T C
4_3802 G A
where 22_3795
is the concatenation of the contig 22
and the position 3795
, separated with an _
.
The solution would preferentially be in bash as R is not practical due to the large file size of my final compressed beagle file (>210GB)
Someone proposed to change the .fai
with this:
awk 'BEGIN{OFS="\t"}{print NR, $2, $3, $4, $5}' my.fna.fai
What I'm not able to figure out now is to make sure that the .fai
and the .beagle
file are consistent with each other. For example, event if the first column (marker) of the .beagle
file is shuffled, it should be possible to match it with the .fai
file and rename the chromosome names in the .beagle
file. For example, if NC_1234.1
is renamed to 142
in the .fai
, then all NC_1234.1_XXX
in the .beagle
should become 142_XXX
, where XXX are numbers.
Here is an attempt at the solution:
awk 'BEGIN{OFS="\t"}{print $1, NR}' my.fna.fai > my.fna.fai.nr
awk -F'\t' -v OFS='\t' '{split($1,a,"_"); print $0,a[1]"_"a[2],a[3]}' MM.beagle.txt | awk 'NR!=1 {print}' | awk 'BEGIN{OFS="\t"}{print $0, NR}'> file2.sep.txt
sort file2.sep.txt > file2.1.s.txt
join -1 4 -2 1 file2.1.s.txt my.fna.fai.nr | sort -k6 -n | awk 'BEGIN{OFS="\t"}{$1=$2=$6="";print $7"_"$5,$0}' | awk 'BEGIN{OFS="\t"}{$4=$5="";print $0}' > file4.txt
echo $(awk 'NR==1 {print}' MM.beagle.txt); cat file4.txt
Gives
marker allele1 allele2
1_3795 G T
3_3796 G T
2_3801 T C
4_3802 G A
CodePudding user response:
To ensure the new FASTA index and modified Beagle files are consistent, we can use the FASTA index and an associative array to store the chromosome name with it's line number. This lets us then parse the Beagle file and use the chromosome name to retrieve the line number from the array. Here's one way using awk:
Contents of rename_chroms.awk
:
BEGIN {
FS=OFS="\t"
}
FNR==NR {
arr[$1]=NR
next
}
FNR==1 {
print
next
}
{
n = split($1, a, "_")
chrom = substr($1, 0, length($1) - (length(a[n]) 1))
pos = a[n]
print arr[chrom] "_" pos, $2, $3
}
Run using:
awk -f rename_chroms.awk my.fna.fai <(zcat MM.beagle.gz)
Results:
marker allele1 allele2
1_3795 G T
2_3796 G T
3_3801 T C
4_3802 G A