Fastest way to generate a kmer count vector from a nucleotide sequence (Julia)-CodePudding

Given a nucleotide sequence, I'm writing some Julia code to generate a sparse vector of (masked) kmer counts, and I would like it to run as fast as possible.

Here is my current implementation,

using Distributions
using SparseArrays

function kmer_profile(seq, k, mask)
    basis = [4^i for i in (k - 1):-1:0]

    d = Dict('A'=>0, 'C'=>1, 'G'=>2, 'T'=>3)

    kmer_dict = Dict{Int, Int32}(4^k=>0)

    for n in 1:(length(seq) - length(mask)   1)
        kmer_hash = 1
        j = 1
        for i in 1:length(mask)
            if mask[i]
                kmer_hash  = d[seq[n i-1]] * basis[j]
                j  = 1
            end
        end
        haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash]  = 1 : kmer_dict[kmer_hash] = 1
    end

    return sparsevec(kmer_dict)
end

seq = join(sample(['A','C','G','T'], 1000000))

mask_str = "111111011111001111111111111110"

mask = BitArray([parse(Bool, string(m)) for m in split(mask_str, "")])
k = sum(mask)

@time kmer_profile(seq, k, mask)

This code runs in about 0.3 seconds on my M1 MacBook Pro, is there any way to make it run significantly faster?

The function kmer_profile uses a sliding window of size length(mask) to count the number of times each masked kmer appears in the nucleotide sequence. A mask is a binary sequence, and a masked kmer is a kmer with nucleotides dropped at positions at which the mask is zero. E.g. the kmer ACGT and mask 1001 will produce the masked kmer AT.

To produce the kmer hash, the function treats each kmer as a base 4 number and then converts it to a (base 10) 64-bit integer, for indexing into the kmer vector.

The size of k is equal to the number of ones in the mask string, and is implicitly limited to 31 so that kmer hashes can fit into a 64-bit integer type.

CodePudding user response：

There are several possible optimizations to make this code faster.

First of all, one can convert the Dict to an array since array-based indexing is faster than dictionary-based indexing one and this is possible here since the key is an ASCII character.

Moreover, the extraction of the sequence codes can be done once instead of length(mask) times by pre-computing code and putting the result in a temporary array.

Additionally, the mask-based conditional and the loop carried dependency make things slow. Indeed, the condition cannot be (easily) predicted by the processor causing it to stall for several cycles. The loop carried dependency make things even worse since the processor can hardly execute other instructions during this stall. This problem can be solved by pre-computing the factors based on both mask and basis. The result is a faster branch-less loop.

Once the above optimizations are done, the biggest bottleneck is sparsevec. In fact, it was also taking nearly half the time of the initial implementation! Optimizing this step is difficult but not impossible. It is slow because of random accesses in the Julia implementation. One can speed this up by sorting the keys-values pairs in the first place. It is faster due to a more cache-friendly execution and it can also help the prediction unit of the processor. This is a complex topic. For more details about how this works, please read Why is processing a sorted array faster than processing an unsorted array?.

Here is the final optimized code:

function kmer_profile_opt(seq, k, mask)
    basis = [4^i for i in (k - 1):-1:0]

    d = zeros(Int8, 128)
    d[Int64('A')] = 0
    d[Int64('C')] = 1
    d[Int64('G')] = 2
    d[Int64('T')] = 3

    seq_codes = [d[Int8(e)] for e in seq]

    j = 1
    premult = zeros(Int64, length(mask))
    for i in 1:length(mask)
        if mask[i]
            premult[i] = basis[j]
            j  = 1
        end
    end

    kmer_dict = Dict{Int, Int32}(4^k=>0)

    for n in 1:(length(seq) - length(mask)   1)
        kmer_hash = 1
        j = 1
        for i in 1:length(mask)
            kmer_hash  = seq_codes[n i-1] * premult[i]
        end
        haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash]  = 1 : kmer_dict[kmer_hash] = 1
    end

    sorted_kmer_pairs = sort(collect(kmer_dict))
    sorted_kmer_keys = [e[1] for e in sorted_kmer_pairs]
    sorted_kmer_values = [e[2] for e in sorted_kmer_pairs]
    return sparsevec(sorted_kmer_keys, sorted_kmer_values)
end

This code is a bit more than twice faster than the initial implementation on my machine. A significant fraction of the time is still spent in the sorting algorithm.

The code can still be optimized further. One way is to use a parallel sort algorithm. Another way is to replace the premult[i] multiplication by a shift which is faster assuming premult[i] is modified so to contain exponents. I expect the code to be about 4 times faster than the original code. The main bottleneck should be the big dictionary creation. Improving further the performance of this is very hard (though it is still possible).

CodePudding user response：

Inspired by Jérôme's answer, and squeezing some more by avoiding Dicts altogether:

function kmer_profile_opt3(seq, k, mask)
    d = zeros(Int8, 128)
    d[Int64('A')] = 0
    d[Int64('C')] = 1
    d[Int64('G')] = 2
    d[Int64('T')] = 3
    seq_codes = [d[Int8(e)] for e in seq]
           
    basis = [4^i for i in (k-1):-1:0]
    j = 1
    premult = zeros(Int64, length(mask))
    for i in 1:length(mask)
        if mask[i]
            premult[i] = basis[j]
            j  = 1
        end
    end

    kmer_vec = Vector{Int}(undef, length(seq)-length(mask) 1)
    @inbounds for n in 1:(length(seq) - length(mask)   1)
        kmer_hash = 1
        for i in 1:length(mask)
            kmer_hash  = seq_codes[n i-1] * premult[i]
        end
        kmer_vec[n] = kmer_hash
    end
    sort!(kmer_vec)
    return sparsevec(kmer_vec, ones(length(kmer_vec)), 4^k,  )
end

This achieved another 2x on my machine.

sparsevec auto-combining of similar index values makes the code compact.