I am trying to iterate over each row of a Polars rust dataframe.
In this endeavour, I have found df.get
but the documentation says that this is slow. Then I have tried df.column("col").get
but this seems to pose similar problems.
What is the correct way to process each row of the dataframe? I need to upload it to a database and turn it into structs.
CodePudding user response:
If you activate the rows
feature in polars, you can try:
DataFrame::get_row
and DataFrame::get_row_amortized
.
The latter is preferred, as that reduces heap allocations by reusing the row buffer.
Anti-pattern
This will be slow. Asking for rows from a columnar data storage will incur many cache misses and goes trough several layers of indirection.
Slightly better
What would be slightly better is using rust iterators. This will have less indirection than the get_row
methods.
df.as_single_chunk_par();
let mut iters = df.columns(["foo", "bar", "ham"])?
.iter().map(|s| s.iter()).collect::<Vec<_>>();
for row in 0..df.height() {
for iter in &mut iters {
let value = iter.next().expect("should have as many iterations as rows");
// process value
}
}
If your DataFrame
consists of a single data type, you should downcast the Series
to a ChunkedArray
, this will speed up iteration.
In the snippet below, we'll assume the data type is Float64
.
let mut iters = df.columns(["foo", "bar", "ham"])?
.iter().map(|s| Ok(s.f64()?.into_iter())).collect::<Result<Vec<_>>>()?;
for row in 0..df.height() {
for iter in &mut iters {
let value = iter.next().expect("should have as many iterations as rows");
// process value
}
}