I am trying to speed up Python programs using Rust, a language in which I am a total beginner. I wrote a function that counts the occurrences of each possible string of length n
within a larger string. For instance, if the main string is "AAAAT"
and n=3
, the outcome would be a hashmap {"AAA":2,"AAT":1}
. I use pyo3 to call the Rust function from Python. The code of the Rust function is:
fn count_nmers(seq : &str,n :usize) -> PyResult<HashMap<&str,u64>> {
let mut current_pos : usize =0;
let mut counts : HashMap<&str,u64> =HashMap::new();
while current_pos n <= seq.len() {
//print!("{}\n", &seq[current_pos..current_pos n]);
match counts.get(&seq[current_pos..current_pos n]) {
Some(repeats) => counts.insert(&seq[current_pos..current_pos n],repeats 1),
None => counts.insert(&seq[current_pos..current_pos n],1)
};
current_pos =1;
}
//print!("{:?}",counts)
Ok(counts)
}
When I use small values for n
(n<10
), Rust is about an order of magnitude faster than Python, but as the length of n increases, the gap tends to zero with both functions having the same speed by n=200
. (see graph)
CodePudding user response:
You are computing hash function multiple times, this may matter for large n values. Try using entry function instead of manual inserts:
while current_pos n <= seq.len() {
let en = counts.entry(&seq[current_pos..current_pos n]).or_default();
*en = 1;
current_pos =1;
}
Next, make sure you are running --release
compiled code like cargo run --release
.
And one more thing to take in mind is discussed here, Rust may use non-optimal hash function for your case which you can change.
And finally, on large data, most of time is spent in HashMap/dict internals which are not a python, but compiled code. So don't expect it to scale well.
CodePudding user response:
Could it be because as n gets larger the number of iterations through the loop gets smaller?
Fewer iterations through the loop would reduce the performance gain seen by using Rust. I'm sure there is a small per function call performance cost for transition/marshaling to Rust from Python. This would explain how eventually the performance from pure Python and Python/Rust becomes the same.