Take cumsum of each row in polars-CodePudding

E.g. if I have

import polars as pl
df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

how would I find the cumulative sum of each row?

Expected output:

Here's the equivalent in pandas:

>>> import pandas as pd
>>> pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]}).cumsum(axis=1)
    a   b
0   1   5
1   2   7
2   3   9

but I can't figure out how to do it in polars

CodePudding user response：

Polars is column-oriented, and as such does not have the concept of a axis. Still, we can use the list evaluation context to solve this.

First, let's expand you data slightly:

df = pl.DataFrame({
    "id": ['a', 'b', 'c'],
    "a": [1, 2, 3],
    "b": [4, 5, 6],
    "c": [7, 8, 9],
})
df

shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id  ┆ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a   ┆ 1   ┆ 4   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   ┆ 5   ┆ 8   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 3   ┆ 6   ┆ 9   │
└─────┴─────┴─────┴─────┘

The Algorithm

Here's a general-purpose performant algorithm that will solve this. We'll walk through it below.

my_cols = [s.name for s in df if s.is_numeric()]
(
    df
    .with_column(
        pl.concat_list(my_cols)
        .arr.eval(pl.element().cumsum())
        .arr.to_struct(name_generator=lambda idx: my_cols[idx])
        .alias('result')
    )
    .drop(my_cols)
    .unnest('result')
)

shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id  ┆ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a   ┆ 1   ┆ 5   ┆ 12  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   ┆ 7   ┆ 15  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 3   ┆ 9   ┆ 18  │
└─────┴─────┴─────┴─────┘

How it works

First, we'll select the names of the numeric columns. You can name these explicitly if you like, e.g., my_cols=['a','b','c'].

Next, we'll gather up the column values into a list using polars.concat_list.

my_cols = [s.name for s in df if s.is_numeric()]
(
    df
    .with_column(
        pl.concat_list(my_cols)
        .alias('result')
    )
)

shape: (3, 5)
┌─────┬─────┬─────┬─────┬───────────┐
│ id  ┆ a   ┆ b   ┆ c   ┆ result    │
│ --- ┆ --- ┆ --- ┆ --- ┆ ---       │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═════╪═════╪═══════════╡
│ a   ┆ 1   ┆ 4   ┆ 7   ┆ [1, 4, 7] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b   ┆ 2   ┆ 5   ┆ 8   ┆ [2, 5, 8] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ c   ┆ 3   ┆ 6   ┆ 9   ┆ [3, 6, 9] │
└─────┴─────┴─────┴─────┴───────────┘

From here, we'll use the arr.eval context to run our cumsum on the list.

my_cols = [s.name for s in df if s.is_numeric()]
(
    df
    .with_column(
        pl.concat_list(my_cols)
        .arr.eval(pl.element().cumsum())
        .alias('result')
    )
)

shape: (3, 5)
┌─────┬─────┬─────┬─────┬────────────┐
│ id  ┆ a   ┆ b   ┆ c   ┆ result     │
│ --- ┆ --- ┆ --- ┆ --- ┆ ---        │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[i64]  │
╞═════╪═════╪═════╪═════╪════════════╡
│ a   ┆ 1   ┆ 4   ┆ 7   ┆ [1, 5, 12] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b   ┆ 2   ┆ 5   ┆ 8   ┆ [2, 7, 15] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c   ┆ 3   ┆ 6   ┆ 9   ┆ [3, 9, 18] │
└─────┴─────┴─────┴─────┴────────────┘

Next, we'll break the list into a struct using arr.to_struct, and name the fields the corresponding names from our selected numeric columns.

my_cols = [s.name for s in df if s.is_numeric()]
(
    df
    .with_column(
        pl.concat_list(my_cols)
        .arr.eval(pl.element().cumsum())
        .arr.to_struct(name_generator=lambda idx: my_cols[idx])
        .alias('result')
    )
)

shape: (3, 5)
┌─────┬─────┬─────┬─────┬───────────┐
│ id  ┆ a   ┆ b   ┆ c   ┆ result    │
│ --- ┆ --- ┆ --- ┆ --- ┆ ---       │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ struct[3] │
╞═════╪═════╪═════╪═════╪═══════════╡
│ a   ┆ 1   ┆ 4   ┆ 7   ┆ {1,5,12}  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b   ┆ 2   ┆ 5   ┆ 8   ┆ {2,7,15}  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ c   ┆ 3   ┆ 6   ┆ 9   ┆ {3,9,18}  │
└─────┴─────┴─────┴─────┴───────────┘

And finally, we'll use unnest to break the struct into columns. (But first we must drop the original columns or else we'll get two columns with the same name.)

my_cols = [s.name for s in df if s.is_numeric()]
(
    df
    .with_column(
        pl.concat_list(my_cols)
        .arr.eval(pl.element().cumsum())
        .arr.to_struct(name_generator=lambda idx: my_cols[idx])
        .alias('result')
    )
    .drop(my_cols)
    .unnest('result')
)

shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id  ┆ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a   ┆ 1   ┆ 5   ┆ 12  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   ┆ 7   ┆ 15  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 3   ┆ 9   ┆ 18  │
└─────┴─────┴─────┴─────┘

CodePudding user response：

There may be a simpler and faster way, but here is the programmatic solution.

Concatenate the values along the columns into a list
Calculate the cumulative sum over the list (the result is still a list)
Get values for each column in the result

import polars as pl

df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

df.select([
    pl.concat_list(pl.all())
    .arr.eval(pl.element().cumsum())
    .alias('cs')
]).select([
    pl.col('cs').arr.get(i).alias(name)
    for i, name in enumerate(df.columns)
])

shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 9   │
└─────┴─────┘