Home > other >  Multiple polars DataFrames with same shape: update one will non-null values of the other
Multiple polars DataFrames with same shape: update one will non-null values of the other

Time:12-18

I have a 100x100 DataFrame (representing a Gaussian KDE) that represents a small sample of data so there are many holes. I have another 100x100 df that represents a larger sample of data with no holes. I would like to combine these dfs into a single 100x100 df but with a preference for the smaller df. In other words, I would like to fill in all the missing gaps in the smaller df with data from the larger df.

I can think of two ways of doing this:

  1. Start with the smaller df and replace all null values with the equivalent value from the larger
  2. Start with the larger df and overlay all non-null values of the smaller df

Is there an efficient way to do this that doesn't involve going cell-by-cell?

CodePudding user response:

You could call .fill_null() on each column:

small.select(
    pl.col(name).fill_null(large.get_column(name)) 
    for name in small.columns
)
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   | b   | c   │
│ --- | --- | --- │
│ str | str | str │
╞═════╪═════╪═════╡
│ a   | DD  | g   │
├─────┼─────┼─────┤
│ BB  | EE  | h   │
├─────┼─────┼─────┤
│ c   | f   | i   │
└─//──┴─//──┴─//──┘

Sample data:

small = pl.DataFrame(dict(a=["a", None, "c"], b=[None, None, "f"], c=["g", "h", "i"]))
large = pl.DataFrame(dict(a=["AA", "BB", "CC"], b=["DD", "EE", "FF"], c=["GG", "HH", "II"]))
>>> small, large
(shape: (3, 3)
 ┌──────┬──────┬─────┐
 │ a    | b    | c   │
 │ ---  | ---  | --- │
 │ str  | str  | str │
 ╞══════╪══════╪═════╡
 │ a    | null | g   │
 ├──────┼──────┼─────┤
 │ null | null | h   │
 ├──────┼──────┼─────┤
 │ c    | f    | i   │
 └─//───┴─//───┴─//──┘,
 shape: (3, 3)
 ┌─────┬─────┬─────┐
 │ a   | b   | c   │
 │ --- | --- | --- │
 │ str | str | str │
 ╞═════╪═════╪═════╡
 │ AA  | DD  | GG  │
 ├─────┼─────┼─────┤
 │ BB  | EE  | HH  │
 ├─────┼─────┼─────┤
 │ CC  | FF  | II  │
 └─//──┴─//──┴─//──┘)

CodePudding user response:

You can use the coalesce function in Polars, which also handles cases where you are coalescing more than two DataFrames.

import polars as pl
from datetime import date

small_sample = pl.DataFrame({
    'col_bool': [False, None, None],
    'col_float': [1.0, None, None],
    'col_dt': [date(2022, 1, 1), None, None],
})
small_sample


medium_sample = pl.DataFrame({
    'col_bool': [True, True, None],
    'col_float': [10.0, 20.0, None],
    'col_dt': [date(2023, 1, 1), date(2023, 2, 1), None],
})
medium_sample


large_sample = pl.DataFrame({
    'col_bool': [True, True, True],
    'col_float': [100.0, 200.0, 300.0],
    'col_dt': [date(2024, 1, 1), date(2024, 2, 1), date(2024, 3, 1)],
})
large_sample
>>> small_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ false    ┆ 1.0       ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null     ┆ null      ┆ null       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null     ┆ null      ┆ null       │
└──────────┴───────────┴────────────┘

>>> medium_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ true     ┆ 10.0      ┆ 2023-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 20.0      ┆ 2023-02-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null     ┆ null      ┆ null       │
└──────────┴───────────┴────────────┘

>>> large_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ true     ┆ 100.0     ┆ 2024-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 200.0     ┆ 2024-02-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 300.0     ┆ 2024-03-01 │
└──────────┴───────────┴────────────┘
pl.select([
    pl.coalesce([small, medium, large])
    for small, medium, large in zip(small_sample, medium_sample, large_sample)
])
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ false    ┆ 1.0       ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 20.0      ┆ 2023-02-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 300.0     ┆ 2024-03-01 │
└──────────┴───────────┴────────────┘
  • Related