Filter list using another list as a boolean mask in polars-CodePudding

I have a polars dataframe containing two columns where both columns are lists.

df = pl.DataFrame({
    'a': [[True, False], [False, True]],
    'b': [['name1', 'name2'], ['name3', 'name4']]
})
df
shape: (2, 2)
┌───────────────┬────────────────────┐
│ a             ┆ b                  │
│ ---           ┆ ---                │
│ list[bool]    ┆ list[str]          │
╞═══════════════╪════════════════════╡
│ [true, false] ┆ ["name1", "name2"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [false, true] ┆ ["name3", "name4"] │
└───────────────┴────────────────────┘

I want to filter column b using column a as a boolean mask. The length of each list in column a is always the same as the length of each list in column b.

I can think of using an explode, then filtering, aggregating, and performing a join, but in some cases a join column is not available, and I would rather avoid this method for simplicity.

Are there any other ways to filter a list using another list as a boolean mask? I have tried using .arr.eval, but it does not seem to accept operations involving other columns.

Any help would be appreciated!

CodePudding user response：

This is not the most ideal solution, as we groom the data to have a group for every list exploded to it's elements. Then we groupby again by that groups and apply the filter.

df = pl.DataFrame({
    'a': [[True, False], [False, True]],
    'b': [['name1', 'name2'], ['name3', 'name4']]
})

(df.with_row_count()
   .explode(["a", "b"])
   .groupby("row_nr")
   .agg([
       pl.col("b").filter(pl.col("a"))
   ])
)


shape: (2, 2)
┌────────┬───────────┐
│ row_nr ┆ b         │
│ ---    ┆ ---       │
│ u32    ┆ list[str] │
╞════════╪═══════════╡
│ 1      ┆ ["name4"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 0      ┆ ["name1"] │
└────────┴───────────┘

Maybe we can come up with something better in polars. It would be nice if the arr.eval could access other columns. TBC!

Edit 02-06-2022

In polars-0.13.41 this will not be so expensive as that you might think. Polars knows that the row_count is sorted and maintains sorted in the whole query. The explodes are also free for the list columns.

When polars knows a groupby key is sorted, the groupby operation will be ~15x faster.

In the query above you would only pay for:

exploding the row count
grouping the sorted key (which is super fast)
traversing the list (which is something we would need to pay anyway).

To ensure that it runs fast, you can run the query with POLARS_VERBOSE=1. This will write the following text to stderr:

could fast explode column a
could fast explode column b
keys/aggregates are not partitionable: running default HASH AGGREGATION
groupby keys are sorted; running sorted key fast path