I have a large vcf.gz file (40GB) that I have to split to be able to load into R and run a script on each of the subset. I want to split it by the first column which worked with
zcat large_data.vcf.gz | cut -f1,2-5,8- | awk '{ print | ("gzip -c > " $1".vcf.gz") }'
But I would like to have the header saved in each subset. The header is not saved into a splitted data (what I thought it would do). It might be because the header starts with a #
#col1 col2 col3 col4 col5 col6 col7 col8
I tried
zcat large_data1.vcf.gz | cut -f1,2-5,8- |
awk 'NR == 1{header = $0; next}
!($1 in filename){ print header | (“gzip -c > “ $1 ".vcf.gz") }
NR > 1 { print $0 | (“gzip -c > “ $1 ".vcf.gz"); filename[$1] }' file
But something is wrong somewhere...
Any idea?
PS: -- filter is not a recognized option
Edit: an example of the data
#col1 col2 col3 col4 col5 col6 col7 col8
1 100 100 100 1000 110 100 110
1 110 100 110 500 200 150 160
2 140 120 100 1000 110 160 210
2 110 180 170 700 220 150 120
Desired data 1-
#col1 col2 col3 col4 col5 col6 col7 col8
1 100 100 100 1000 110 100 110
1 110 100 110 500 200 150 160
and 2-
#col1 col2 col3 col4 col5 col6 col7 col8
2 140 120 100 1000 110 160 210
2 110 180 170 700 220 150 120
CodePudding user response:
You might have to save the header and the filenames in the awk program:
zcat large_data.vcf.gz |
cut -f1,2-5,8- |
awk '
BEGIN{ getline header }
{
filename = $1".vcf.gz"
if (!seen[$1]) {
print header | ("gzip -c > " filename)
seen[$1]
}
print | ("gzip -c > " filename)
}
'
remark: why getline
? because using NR==1
and NR>1
for a 40GB file is unnecessarily slower
CodePudding user response:
I just made it works with
zcat large_data.vcf.gz |
cut -f1,2-5,8- |
awk 'NR == 1{header = $0; next}
!($1 in filename){ print header | "gzip > " $1 ".vcf.gz" }
NR > 1 { print $0 | "gzip > " $1 ".vcf.gz" }'