Multiple polars DataFrames with same shape: update one will non-null values of the other-CodePudding

I have a 100x100 DataFrame (representing a Gaussian KDE) that represents a small sample of data so there are many holes. I have another 100x100 df that represents a larger sample of data with no holes. I would like to combine these dfs into a single 100x100 df but with a preference for the smaller df. In other words, I would like to fill in all the missing gaps in the smaller df with data from the larger df.

I can think of two ways of doing this:

Start with the smaller df and replace all null values with the equivalent value from the larger
Start with the larger df and overlay all non-null values of the smaller df

Is there an efficient way to do this that doesn't involve going cell-by-cell?

CodePudding user response：

You could call .fill_null() on each column:

small.select(
    pl.col(name).fill_null(large.get_column(name)) 
    for name in small.columns
)
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   | b   | c   │
│ --- | --- | --- │
│ str | str | str │
╞═════╪═════╪═════╡
│ a   | DD  | g   │
├─────┼─────┼─────┤
│ BB  | EE  | h   │
├─────┼─────┼─────┤
│ c   | f   | i   │
└─//──┴─//──┴─//──┘

Sample data:

small = pl.DataFrame(dict(a=["a", None, "c"], b=[None, None, "f"], c=["g", "h", "i"]))
large = pl.DataFrame(dict(a=["AA", "BB", "CC"], b=["DD", "EE", "FF"], c=["GG", "HH", "II"]))
>>> small, large
(shape: (3, 3)
 ┌──────┬──────┬─────┐
 │ a    | b    | c   │
 │ ---  | ---  | --- │
 │ str  | str  | str │
 ╞══════╪══════╪═════╡
 │ a    | null | g   │
 ├──────┼──────┼─────┤
 │ null | null | h   │
 ├──────┼──────┼─────┤
 │ c    | f    | i   │
 └─//───┴─//───┴─//──┘,
 shape: (3, 3)
 ┌─────┬─────┬─────┐
 │ a   | b   | c   │
 │ --- | --- | --- │
 │ str | str | str │
 ╞═════╪═════╪═════╡
 │ AA  | DD  | GG  │
 ├─────┼─────┼─────┤
 │ BB  | EE  | HH  │
 ├─────┼─────┼─────┤
 │ CC  | FF  | II  │
 └─//──┴─//──┴─//──┘)

CodePudding user response：

You can use the coalesce function in Polars, which also handles cases where you are coalescing more than two DataFrames.

import polars as pl
from datetime import date

small_sample = pl.DataFrame({
    'col_bool': [False, None, None],
    'col_float': [1.0, None, None],
    'col_dt': [date(2022, 1, 1), None, None],
})
small_sample


medium_sample = pl.DataFrame({
    'col_bool': [True, True, None],
    'col_float': [10.0, 20.0, None],
    'col_dt': [date(2023, 1, 1), date(2023, 2, 1), None],
})
medium_sample


large_sample = pl.DataFrame({
    'col_bool': [True, True, True],
    'col_float': [100.0, 200.0, 300.0],
    'col_dt': [date(2024, 1, 1), date(2024, 2, 1), date(2024, 3, 1)],
})
large_sample

>>> small_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ false    ┆ 1.0       ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null     ┆ null      ┆ null       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null     ┆ null      ┆ null       │
└──────────┴───────────┴────────────┘

>>> medium_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ true     ┆ 10.0      ┆ 2023-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 20.0      ┆ 2023-02-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null     ┆ null      ┆ null       │
└──────────┴───────────┴────────────┘

>>> large_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ true     ┆ 100.0     ┆ 2024-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 200.0     ┆ 2024-02-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 300.0     ┆ 2024-03-01 │
└──────────┴───────────┴────────────┘

pl.select([
    pl.coalesce([small, medium, large])
    for small, medium, large in zip(small_sample, medium_sample, large_sample)
])

shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ false    ┆ 1.0       ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 20.0      ┆ 2023-02-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true     ┆ 300.0     ┆ 2024-03-01 │
└──────────┴───────────┴────────────┘