Home > front end >  How to split a vcf.gz file based on the first column, keeping the header in each subset and save bac
How to split a vcf.gz file based on the first column, keeping the header in each subset and save bac

Time:05-11

I have a large vcf.gz file (40GB) that I have to split to be able to load into R and run a script on each of the subset. I want to split it by the first column which worked with

zcat large_data.vcf.gz | cut -f1,2-5,8- | awk '{ print | ("gzip -c > " $1".vcf.gz") }'

But I would like to have the header saved in each subset. The header is not saved into a splitted data (what I thought it would do). It might be because the header starts with a #

#col1  col2  col3  col4  col5  col6  col7  col8

I tried

zcat large_data1.vcf.gz | cut -f1,2-5,8- | 
    awk 'NR == 1{header = $0; next} 
    !($1 in filename){ print header | (“gzip -c > “ $1 ".vcf.gz") } 
    NR > 1 { print $0 | (“gzip -c > “ $1 ".vcf.gz"); filename[$1] }' file

But something is wrong somewhere...

Any idea?
PS: -- filter is not a recognized option

Edit: an example of the data

#col1  col2  col3  col4  col5  col6  col7  col8
1  100  100  100  1000  110  100  110
1  110  100  110  500  200  150  160
2  140  120  100  1000  110  160  210
2  110  180  170  700  220  150  120

Desired data 1-

#col1  col2  col3  col4  col5  col6  col7  col8
1  100  100  100  1000  110  100  110
1  110  100  110  500  200  150  160

and 2-

#col1  col2  col3  col4  col5  col6  col7  col8
2  140  120  100  1000  110  160  210
2  110  180  170  700  220  150  120

CodePudding user response:

You might have to save the header and the filenames in the awk program:

zcat large_data.vcf.gz |
cut -f1,2-5,8- |
awk '
    BEGIN{ getline header }
    {
        filename = $1".vcf.gz"
        if (!seen[$1]) {
            print header | ("gzip -c > " filename)
            seen[$1]  
        }
        print | ("gzip -c > " filename)
    }
'

remark: why getline? because using NR==1 and NR>1 for a 40GB file is unnecessarily slower

CodePudding user response:

I just made it works with

zcat large_data.vcf.gz | 
cut -f1,2-5,8- | 
awk 'NR == 1{header = $0; next} 
!($1 in filename){ print header | "gzip > " $1 ".vcf.gz" } 
NR > 1 { print $0 | "gzip > " $1 ".vcf.gz" }'
  • Related