How to rename chromosome_position column in a Beagle file and match it with the index fai?-CodePudding

I have text files (tab separated) which have different columns. I need to rename my chromosome_position column (see MM.beagle.gz file below) since a program that I use don't allow multiple underscores in the chromosome name (causing a parsing issue because NC_044592.1_3795 is not working as a name).

My indexed genome looks like this:

head my.fna.fai

Contains this:

NC_044571.1     115307910       88      80      81
NC_044572.1     151975198       116749435       80      81
NC_044573.1     113180500       270624411       80      81
NC_044574.1     71869398        385219756       80      81

The bealgle file looks like this:

zcat MM.beagle.gz | head | cut -f 1-3

Which gives:

marker  allele1 allele2
NC_044571.1_3795        G       T
NC_044573.1_3796        G       T
NC_044572.1_3801        T       C
NC_044574.1_3802        G       A

In R I can get the chromosome and position:

beag = read.table("MM.beagle.gz", header = TRUE)
chr=gsub("_\\d $", "", beag$marker)
pos=gsub("^[A-Z]*_[0-9]*.[0-9]_", "", beag$marker)

But I'm not able to rename the beagle file in-place. I'd like to rename all contigs in the .fai file from 1:nrow(my.fna.fai) and match it to the beagle file.

So in the end the .fai should look like:

head my.fna.fai

Desired output:

1     115307910       88      80      81
2     151975198       116749435       80      81
3     113180500       270624411       80      81
4     71869398        385219756       80      81

And the beagle file:

zcat MM.beagle.gz | head | cut -f 1-3

Would give:

marker  allele1 allele2
1_3795        G       T
3_3796        G       T
2_3801        T       C
4_3802        G       A

where 22_3795 is the concatenation of the contig 22 and the position 3795, separated with an _.

The solution would preferentially be in bash as R is not practical due to the large file size of my final compressed beagle file (>210GB)

Someone proposed to change the .fai with this:

awk 'BEGIN{OFS="\t"}{print NR, $2, $3, $4, $5}' my.fna.fai

What I'm not able to figure out now is to make sure that the .fai and the .beagle file are consistent with each other. For example, event if the first column (marker) of the .beagle file is shuffled, it should be possible to match it with the .fai file and rename the chromosome names in the .beagle file. For example, if NC_1234.1 is renamed to 142 in the .fai, then all NC_1234.1_XXX in the .beagle should become 142_XXX, where XXX are numbers.

Here is an attempt at the solution:

awk 'BEGIN{OFS="\t"}{print $1, NR}' my.fna.fai > my.fna.fai.nr

awk -F'\t' -v OFS='\t' '{split($1,a,"_"); print $0,a[1]"_"a[2],a[3]}' MM.beagle.txt | awk 'NR!=1 {print}' | awk 'BEGIN{OFS="\t"}{print $0, NR}'> file2.sep.txt

sort file2.sep.txt > file2.1.s.txt

join -1 4 -2 1 file2.1.s.txt  my.fna.fai.nr | sort -k6 -n   | awk 'BEGIN{OFS="\t"}{$1=$2=$6="";print $7"_"$5,$0}' | awk 'BEGIN{OFS="\t"}{$4=$5="";print $0}' > file4.txt
echo $(awk 'NR==1 {print}' MM.beagle.txt); cat file4.txt

Gives

marker allele1 allele2
1_3795  G       T
3_3796  G       T
2_3801  T       C
4_3802  G       A

CodePudding user response：

To ensure the new FASTA index and modified Beagle files are consistent, we can use the FASTA index and an associative array to store the chromosome name with it's line number. This lets us then parse the Beagle file and use the chromosome name to retrieve the line number from the array. Here's one way using awk:

Contents of rename_chroms.awk:

BEGIN {
    FS=OFS="\t"
}

FNR==NR {
    arr[$1]=NR
    next
}

FNR==1 {
    print
    next
}

{
    n = split($1, a, "_")

    chrom = substr($1, 0, length($1) - (length(a[n])   1))
    pos = a[n]

    print arr[chrom] "_" pos, $2, $3
}

Run using:

awk -f rename_chroms.awk my.fna.fai <(zcat MM.beagle.gz)

Results:

marker  allele1 allele2
1_3795  G   T
2_3796  G   T
3_3801  T   C
4_3802  G   A