For distance, I want to accomplish conversion like below.
┌────────────┐
│ col │
│ --- │
│ list[str] │
╞════════════╡
│ ["a"] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["a", "b"] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["c"] │
└────────────┘
↓
↓
↓
┌────────────┬─────────────────┐
│ col ┆ col_cum │
│ --- ┆ --- │
│ list[str] ┆ list[str] │
╞════════════╪═════════════════╡
│ ["a"] ┆ ["a"] │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["a", "b"] ┆ ["a", "b"] │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["c"] ┆ ["a", "b", "c"] │
└────────────┴─────────────────┘
I've tried polars.Expr.cumulative_eval()
, but not work.
From the Offical API example. I can access the first element and last element in every iteration. But I want here is the result of the previous iteration i think.
Could I get some help?
CodePudding user response:
import pandas as pd
df = pd.DataFrame([[['a']], [['a', 'b']], [['c']]], columns=['list'])
df['list_cum'] = df.list.cumsum().apply(set).apply(sorted)
print(df)
prints
index | list | list_cum |
---|---|---|
0 | a | a |
1 | a,b | a,b |
2 | c | a,b,c |
CodePudding user response:
We can use the cumulative_eval
expression.
But first, let's expand your data so that we can include some other things that may be of interest.
- We'll include a
group
variable to show how the algorithm can be used with grouping variables. - We'll also include an empty list to show how the algorithm will handle these.
import polars as pl
df = pl.DataFrame(
{
"group": [1, 1, 1, 2, 2, 2, 2],
"var": [["a"], ["a", "b"], ["c"], ["p"], ["q", "p"], [], ["s"]],
}
)
df
shape: (7, 2)
┌───────┬────────────┐
│ group ┆ var │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═══════╪════════════╡
│ 1 ┆ ["a"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["a", "b"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["c"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["p"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["q", "p"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["s"] │
└───────┴────────────┘
The Algorithm
Here's the heart of the algorithm:
(
df
.with_column(
pl.col('var')
.cumulative_eval(
pl.element()
.explode()
.unique()
.sort()
.list()
)
.over('group')
.alias('cumulative')
)
)
shape: (7, 3)
┌───────┬────────────┬──────────────────────┐
│ group ┆ var ┆ cumulative │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[str] │
╞═══════╪════════════╪══════════════════════╡
│ 1 ┆ ["a"] ┆ ["a"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["a", "b"] ┆ ["a", "b"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["c"] ┆ ["a", "b", "c"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["p"] ┆ ["p"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["q", "p"] ┆ ["p", "q"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [] ┆ [null, "p", "q"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["s"] ┆ [null, "p", ... "s"] │
└───────┴────────────┴──────────────────────┘
Notice the null
introduced by the empty list. If you want, you can eliminate this with a filtering step.
(
df
.with_column(
pl.col('var')
.cumulative_eval(
pl.element()
.explode()
.unique()
.sort()
.list()
)
.over('group')
.alias('cumulative')
)
.with_column(
pl.col('cumulative').arr.eval(
pl.element().filter(pl.element().is_not_null())
)
)
)
shape: (7, 3)
┌───────┬────────────┬─────────────────┐
│ group ┆ var ┆ cumulative │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[str] │
╞═══════╪════════════╪═════════════════╡
│ 1 ┆ ["a"] ┆ ["a"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["a", "b"] ┆ ["a", "b"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["c"] ┆ ["a", "b", "c"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["p"] ┆ ["p"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["q", "p"] ┆ ["p", "q"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [] ┆ ["p", "q"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["s"] ┆ ["p", "q", "s"] │
└───────┴────────────┴─────────────────┘
How it works
cumulative_eval
allows us to treat a subset of rows for a column as if it was a Series itself (with the exception that we access the elements of the underlying Series using polars.element
.)
So, let's simulate what the cumulative_eval
expression is doing by working work with the Series itself directly. We'll simulate what the algorithm does when cumulative_eval
reaches the last row where group == 1
(the third row).
The first major step of the algorithm is to explode
the lists. explode
will put each element of each list on its own row:
(
df
.select(
pl.col('var')
.filter(pl.col('group') == 1)
.explode()
)
)
shape: (4, 1)
┌─────┐
│ var │
│ --- │
│ str │
╞═════╡
│ a │
├╌╌╌╌╌┤
│ a │
├╌╌╌╌╌┤
│ b │
├╌╌╌╌╌┤
│ c │
└─────┘
In the next step, we will use unique
and sort
to eliminate duplicates and keep the order consistent.
(
df
.select(
pl.col('var')
.filter(pl.col('group') == 1)
.explode()
.unique()
.sort()
)
)
shape: (3, 1)
┌─────┐
│ var │
│ --- │
│ str │
╞═════╡
│ a │
├╌╌╌╌╌┤
│ b │
├╌╌╌╌╌┤
│ c │
└─────┘
At this point, we need only to roll up all the values into a list.
(
df
.select(
pl.col('var')
.filter(pl.col('group') == 1)
.explode()
.unique()
.sort()
.list()
)
)
shape: (1, 1)
┌─────────────────┐
│ var │
│ --- │
│ list[str] │
╞═════════════════╡
│ ["a", "b", "c"] │
└─────────────────┘
And that is the value that cumulative_eval
returns for the third row.
Performance
The documentation for cumulative_eval
comes with a strong warning about performance.
Warning: This can be really slow as it can have O(n^2) complexity. Don’t use this for operations that visit all elements.
Let's simulate some data. The code below generates about 9.5 million records, 10,000 groups, so that there are about 950 observations per group.
import numpy as np
from string import ascii_lowercase
rng = np.random.default_rng(1)
nbr_rows = 10_000_000
df = (
pl.DataFrame({
'group': rng.integers(1, 10_000, size=nbr_rows),
'sub_list': rng.integers(1, 10_000, size=nbr_rows),
'var': rng.choice(list(ascii_lowercase), nbr_rows)
})
.groupby(['group', 'sub_list'])
.agg(
pl.col('var')
)
.drop('sub_list')
.sort('group')
)
df
shape: (9515737, 2)
┌───────┬────────────┐
│ group ┆ var │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═══════╪════════════╡
│ 1 ┆ ["q", "r"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["z"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["b"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["j"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999 ┆ ["z"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999 ┆ ["e"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999 ┆ ["s"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999 ┆ ["s"] │
└───────┴────────────┘
One my 32-core system, here's the wall-clock time:
import time
start = time.perf_counter()
(
df
.with_column(
pl.col('var')
.cumulative_eval(
pl.element()
.explode()
.unique()
.sort()
.list()
)
.over('group')
.alias('cumulative')
)
.with_column(
pl.col('cumulative').arr.eval(
pl.element().filter(pl.element().is_not_null())
)
)
)
print(time.perf_counter() - start)
shape: (9515737, 3)
┌───────┬────────────┬─────────────────────┐
│ group ┆ var ┆ cumulative │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[str] ┆ list[str] │
╞═══════╪════════════╪═════════════════════╡
│ 1 ┆ ["q", "r"] ┆ ["q", "r"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["z"] ┆ ["q", "r", "z"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["b"] ┆ ["b", "q", ... "z"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["j"] ┆ ["b", "j", ... "z"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999 ┆ ["z"] ┆ ["a", "b", ... "z"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999 ┆ ["e"] ┆ ["a", "b", ... "z"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999 ┆ ["s"] ┆ ["a", "b", ... "z"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999 ┆ ["s"] ┆ ["a", "b", ... "z"] │
└───────┴────────────┴─────────────────────┘
>>> print(time.perf_counter() - start)
118.46121257600134
Roughly 2 minutes on my system for 9.5 million records. Depending on your system, you may get better or worse performance. But the point is that is didn't take hours to complete.
If you need better performance, we can come up with a better-performing algorithm (or perhaps put in a feature request for a cumlist
feature in Polars, which might have a complexity better than the O(n^2) complexity of cumulative_eval
.)