I have multiple WIG files in variableStep format that I am trying to merge. These files are used to store basepair - score pairs in a genomic context.
For example files A and B could look like that:
A
variableStep chrom=chr1
9967151 1.0
9967152 0.5
variableStep chrom=chr2
1107150 0.2
1107151 0.3
B
variableStep chrom=chr1
10001000 0.3
10001001 0.4
variableStep chrom=chr2
20002000 0.9
20002001 0.7
The final result I am looking for is this:
AxB
variableStep chrom=chr1
9967151 1.0
9967152 0.5
10001000 0.3
10001001 0.4
variableStep chrom=chr2
1107150 0.2
1107151 0.3
20002000 0.9
20002001 0.7
Do you know an efficient version of how I could merge two such files efficiently?
They could get rather large and I would like to avoid looping through it and keeping track of the header lines too much. I am looking for a bash command or something existent in R for example.
The size of a small subset of the large final file I currently have is about 800 MB but this will over time grow intensely to a range of tens to hundred GB.
It would be nice to have it numerically sorted per block but it is not crucial to be able to work with it.
The file will in a later step be transformed into a smaller binary file in the so called BigWIG format.
Background:
I want to build one large file to store information for all base pairs in the human genome with all chromosomes in one file. Since my scores that I store will be calculated in smaller individual files over time I want to update the final large file each time I get a new snippet like A or B. Ideally, these should even be unique, so that lines should be integrated only when they are not yet available, but for now the simple merge would be enough
What I tried so far:
I know how to use cat
to concatenate files but the problem here is that I will have more header lines than it should have
Also, I have a solution where I loop through each line, keep track of which chromosome I currently have and append the lines at the correct position. However, this takes rather long for a large dataset
Using rtracklayer::import in R, I can read the WIG files, append them, make them unique and export them again. This works, but it is very memory consuming to read the whole "database" file into memory each time I want to update
Thank you!
CodePudding user response:
Assumptions:
- no need to (re)sort data
- the 'small' file can fit into memory
- duplicate scores are maintained
Setup:
$ cat smallfile
variableStep chrom=chr1 # merge
10001000 0.3
10001001 0.4
variableStep chrom=chr2 # merge
1107150 0.2 # duplicate
20002000 0.9
20002001 0.7
variableStep chrom=chr78 # new
3341000 0.3
3526001 0.4
$ cat bigfile
variableStep chrom=chr1
9967151 1.0
9967152 0.5
variableStep chrom=chr117
2347150 0.2
2347151 0.3
variableStep chrom=chr2
1107150 0.2
1107151 0.3
variableStep chrom=chr36
4727150 0.2
4727151 0.3
One GNU awk
(for multi-dimensional arrays) idea:
awk '
### 1st file / smallfile
FNR==NR { if ($1 == "variableStep") { # if new header line then reset variables ...
chrom=$2
n=0
next
}
scores[chrom][ n]=$0 # store non-header lines in scores[] array
next
}
### 2nd file / bigfile
{ print # print current line
if ($1 == "variableStep") # if current line is a header line and ...
if ($2 in scores) { # the "chrom=chrX" is an index in the scores[] array then ...
for (n in scores[$2]) # loop through 2nd index and ...
print scores[$2][n] # print the array entry to stdout (ie, add to our output stream)
delete scores[$2] # delete data to keep from being added again in the "END" block
}
}
### add "new" entries (from 1st file / smallfile)
END { for (chrom in scores) {
print "variableStep", chrom
for (n in scores[chrom])
print scores[chrom][n]
}
}
' smallfile bigfile > newbigfile
This generates:
$ cat newbigfile
variableStep chrom=chr1 # merged
10001000 0.3 # added
10001001 0.4 # added
9967151 1.0
9967152 0.5
variableStep chrom=chr117
2347150 0.2
2347151 0.3
variableStep chrom=chr2 # merged
1107150 0.2 # added; duplicate
20002000 0.9 # added
20002001 0.7 # added
1107150 0.2 # duplicate
1107151 0.3
variableStep chrom=chr36
4727150 0.2
4727151 0.3
variableStep chrom=chr78 # added; new
3341000 0.3 # added
3526001 0.4 # added
NOTES:
- sorting could be added but will increase memory usage (likely not an issue if just sorting scores within a given header line)
- removing duplicate scores is doable with a negligble bump up in memory usage
- add "new" entries into
bigfile
in 'header line' sorted order is doable with a bit more coding