Compare two Pola-rs dataframes by position-CodePudding

Suppose I have two dataframes like:

let df_1 = df! {
        "1" => [1,   2,   2,   3,   4,   3],
        "2" => [1,   4,   2,   3,   4,   3],
        "3" => [1,   2,   6,   3,   4,   3],
    }
    .unwrap();

    let mut df_2 = df_1.clone();
    for idx in 0..df_2.width() {
        df_2.apply_at_idx(idx, |s| {
            s.cummax(false)
                .shift(1)
                .fill_null(FillNullStrategy::Zero)
                .unwrap()
        })
        .unwrap();
    }

    println!("{:#?}", df_1);
    println!("{:#?}", df_2);

shape: (6, 3)
┌─────┬─────┬─────┐
│ 1   ┆ 2   ┆ 3   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 2   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 3   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 4   ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 3   ┆ 3   │
└─────┴─────┴─────┘
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1   ┆ 2   ┆ 3   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 0   ┆ 0   ┆ 0   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 1   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 4   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 4   ┆ 6   │
└─────┴─────┴─────┘

and I want to compare them such that I end up with a boolean dataframe I can use as a predicate for a selection and aggregation:

shape: (6, 3)
┌───────┬───────┬───────┐
│ 1     ┆ 2     ┆ 3     │
│ ---   ┆ ---   ┆ ---   │
│ bool  ┆ bool  ┆ bool  │
╞═══════╪═══════╪═══════╡
│ true  ┆ true  ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ true  ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ false ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ true  ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false ┆ false │
└───────┴───────┴───────┘

In Python Pandas I might do df.where(df_1.ge(df_2)).sum().sum(). What's the idiomatic way to do that with Rust Pola-rs?

CodePudding user response：

It took me the longest time to figure out how to even do elementwise addition in polars. I guess that's just not the "normal" way to use these things as in principle the columns can have different data types.

So. I asked ChatGPT. Wait. Don't ban me yet. ChatGPT messed it up and gave the wrong answer anyway. But it pointed me in the right direction. It wanted to just call zip and map on the dataframe directly. That doesn't work.

But. df has a method iter() that gives you an iterater over all the columns. The columns are Series, and for those you have all sorts of elementwise operations implemented.

Long story short

let df = df!("A" => &[1, 2, 3], "B" => &[4, 5, 6]).unwrap();
let df2 = df!("A" => &[6, 5, 4], "B" => &[3, 2, 1]).unwrap();

let df3 = DataFrame::new(
            df.iter()
              .zip(df2.iter())
              .map(|(series1, series2)| series1.gt(series2).unwrap())
              .collect());

That gives you your boolean array. From here, I assume it should be possible to figure out how to use that for indexing. Probably another use of df.iter().zip(df3) or whatever.