Extracting certain locus from multiple samples from text file-CodePudding

After profiling 800,000 STR locus in over 100 samples, the output gave me 122 files (1 sample per file). My file look like this:

SAMPLE  CHROM   POS Allele_1    Allele_2    LENGTH
HG02526 chr15 17019727 (ata)4 (ata)4 3
HG02526 chr15 17035572 (tta)4 (tta)4 3
HG02526 chr15 17043558 (ata)4 (ata)4 3
HG02526 chr15 19822808 (ttta)3 (ttta)3 4
HG02526 chr15 19844660 (taca)3 (taca)3 4

I've been trying to extract variants for each locus with this command line:

awk '$1!="SAMPLE" {print $0 > $2"_"$3".locus.tsv"}' *.vcf

It worked but it take lots of time to create large number of files for each locus.

I am struggling to find an optimal solution to solve this?

CodePudding user response：

You aren't closing the output files as you go so if you have a large number of them then your script will either slow down significantly trying to manage them all (e.g. with gawk) or fail saying "too many output files" (with most other awks).

Assuming your input is sorted by fields 2 and 3 (if it's not, then sort it), you should be using the following with any awk:

awk '
    FNR==1 { next }
    { cur = $2 "_" $3 ".locus.tsv" }
    cur != out { close(out); out=cur }
    { print > out }
' *.vcf

If you want to have the header line present in every output file then tweak that to:

awk '
    FNR==1 { if (NR==1) hdr=$0; next }
    { cur = $2 "_" $3 ".locus.tsv" }
    cur != out { close(out); out=cur; print hdr > out }
    { print > out }
' *.vcf

Both of the above assume that the same $2/$3 pair can't exist in multiple input files. If they can then edit your question to show at least 2 input files that cover that case along with the expected output.

CodePudding user response：

My VCF file look like this:

SAMPLE  CHROM   POS Allele_1    Allele_2    LENGTH
HG02526 chr15 17019727 (ata)4 (ata)4 3
HG02526 chr15 17035572 (tta)4 (tta)4 3
HG02526 chr15 17043558 (ata)4 (ata)4 3
HG02526 chr15 19822808 (ttta)3 (ttta)3 4
HG02526 chr15 19844660 (taca)3 (taca)3 4

this is NOT a vcf file
for such file, sort on chrom,pos, compress with bgzip and index with tabix and query with tabix. http://www.htslib.org/doc/tabix.html

CodePudding user response：

You can try processing everything in memory before printing them.

FNR > 1 {
    i = $2 "_" $3
    b[i,   a[i]] = $0
}
END {
    for (i in a) {
        n = i ".locus.tsv"
        for (j = 1; j <= a[i];   j)
            print b[i, j] > n
        close(n)
    }
}

This may work depending on the size of your files and the amount of memory your machine has. Using another language that allows having a dynamic array as value can also be more efficient.