How to Write Poisson CDF as Python Polars Expression-CodePudding

I have a collection of polars expressions being used to generate features for an ML model. I'd like to add a poission cdf feature to this collection whilst maintaining lazy execution (with benefits of speed, caching etc...). I so far have not found an easy way of achieving this.

I've been able to get the result I'd like outside of the desired lazy expression framework with:

import polars as pl
from scipy.stats import poisson

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_column(pl.Series(result).alias("poission_cdf"))

However, in reality I'd like this to look like:

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
df = df.select(
    [
        ... # bunch of other expressions here
        poisson_cdf()
    ]
)

where poisson_cdf is some polars expression like:

def poisson_cdf():
    # this is just for illustration, clearly wont work
    return scipy.stats.poisson.cdf(pl.col("count"), pl.col("expected_count")).alias("poisson_cdf")

I also tried using a struct made up of "count" and "expected_count" and apply like advised in the docs when applying custom functions. However, my dataset is several millions of rows in reality - leading to absurd execution time.

Any advice or guidance here would be appreciated. Ideally there exists an expression like this somewhere out there? Thanks in advance!

CodePudding user response：

If scipy.stats.poisson.cdf was implemented as a proper numpy universal function, it would be possible to use it directly on polars expressions, but it is not. Fortunately, Poisson CDF is almost the same as regularized upper incomplete gamma function for which scipy supplies gammaincc which can be used in polars expressions:

>>> import polars as pl
>>> from scipy.special import gammaincc
>>> df = pl.select(pl.arange(0, 10).alias('k'))
>>> df.with_columns(cdf=gammaincc(pl.col('k')   1, 4.0))
shape: (10, 2)
┌─────┬──────────┐
│ k   ┆ cdf      │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 0   ┆ 0.018316 │
│ 1   ┆ 0.091578 │
│ 2   ┆ 0.238103 │
│ 3   ┆ 0.43347  │
│ ... ┆ ...      │
│ 6   ┆ 0.889326 │
│ 7   ┆ 0.948866 │
│ 8   ┆ 0.978637 │
│ 9   ┆ 0.991868 │
└─────┴──────────┘

The result is the same as returned by poisson.cdf:

>>> _.with_columns(cdf2=pl.lit(poisson.cdf(df['k'], 4)))
shape: (10, 3)
┌─────┬──────────┬──────────┐
│ k   ┆ cdf      ┆ cdf2     │
│ --- ┆ ---      ┆ ---      │
│ i64 ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╡
│ 0   ┆ 0.018316 ┆ 0.018316 │
│ 1   ┆ 0.091578 ┆ 0.091578 │
│ 2   ┆ 0.238103 ┆ 0.238103 │
│ 3   ┆ 0.43347  ┆ 0.43347  │
│ ... ┆ ...      ┆ ...      │
│ 6   ┆ 0.889326 ┆ 0.889326 │
│ 7   ┆ 0.948866 ┆ 0.948866 │
│ 8   ┆ 0.978637 ┆ 0.978637 │
│ 9   ┆ 0.991868 ┆ 0.991868 │
└─────┴──────────┴──────────┘

CodePudding user response：

It sounds like you want to use .map() instead of .apply() - which will pass whole columns at once.

df.select([
   pl.all(),
   # ...
   pl.struct(["count", "expected_count"])
     .map(lambda x: 
        poisson.cdf(x.struct.field("count"), x.struct.field("expected_count")))
     .flatten()
     .alias("poisson_cdf")
])

shape: (5, 3)
┌───────┬────────────────┬─────────────┐
│ count | expected_count | poisson_cdf │
│ ---   | ---            | ---         │
│ i64   | f64            | f64         │
╞═══════╪════════════════╪═════════════╡
│ 9     | 7.7            | 0.75308     │
│ 2     | 0.2            | 0.998852    │
│ 3     | 0.7            | 0.994247    │
│ 4     | 1.1            | 0.994565    │
│ 5     | 7.5            | 0.241436    │
└───────┴────────────────┴─────────────┘