Home > OS >  Iterate over rows polars rust
Iterate over rows polars rust

Time:05-31

I am trying to iterate over each row of a Polars rust dataframe.

In this endeavour, I have found df.get but the documentation says that this is slow. Then I have tried df.column("col").get but this seems to pose similar problems.

What is the correct way to process each row of the dataframe? I need to upload it to a database and turn it into structs.

CodePudding user response:

If you activate the rows feature in polars, you can try:

DataFrame::get_row and DataFrame::get_row_amortized.

The latter is preferred, as that reduces heap allocations by reusing the row buffer.

Anti-pattern

This will be slow. Asking for rows from a columnar data storage will incur many cache misses and goes trough several layers of indirection.

Slightly better

What would be slightly better is using rust iterators. This will have less indirection than the get_row methods.

df.as_single_chunk_par();
let mut iters = df.columns(["foo", "bar", "ham"])?
    .iter().map(|s| s.iter()).collect::<Vec<_>>();

for row in 0..df.height() {
    for iter in &mut iters {
        let value = iter.next().expect("should have as many iterations as rows");
        // process value
    }
}

If your DataFrame consists of a single data type, you should downcast the Series to a ChunkedArray, this will speed up iteration.

In the snippet below, we'll assume the data type is Float64.

let mut iters = df.columns(["foo", "bar", "ham"])?
    .iter().map(|s| Ok(s.f64()?.into_iter())).collect::<Result<Vec<_>>>()?;

for row in 0..df.height() {
    for iter in &mut iters {
        let value = iter.next().expect("should have as many iterations as rows");
        // process value
    }
}
  • Related