I have a 100x100 DataFrame (representing a Gaussian KDE) that represents a small sample of data so there are many holes. I have another 100x100 df that represents a larger sample of data with no holes. I would like to combine these dfs into a single 100x100 df but with a preference for the smaller df. In other words, I would like to fill in all the missing gaps in the smaller df with data from the larger df.
I can think of two ways of doing this:
- Start with the smaller df and replace all null values with the equivalent value from the larger
- Start with the larger df and overlay all non-null values of the smaller df
Is there an efficient way to do this that doesn't involve going cell-by-cell?
CodePudding user response:
You could call .fill_null()
on each column:
small.select(
pl.col(name).fill_null(large.get_column(name))
for name in small.columns
)
shape: (3, 3)
┌─────┬─────┬─────┐
│ a | b | c │
│ --- | --- | --- │
│ str | str | str │
╞═════╪═════╪═════╡
│ a | DD | g │
├─────┼─────┼─────┤
│ BB | EE | h │
├─────┼─────┼─────┤
│ c | f | i │
└─//──┴─//──┴─//──┘
Sample data:
small = pl.DataFrame(dict(a=["a", None, "c"], b=[None, None, "f"], c=["g", "h", "i"]))
large = pl.DataFrame(dict(a=["AA", "BB", "CC"], b=["DD", "EE", "FF"], c=["GG", "HH", "II"]))
>>> small, large
(shape: (3, 3)
┌──────┬──────┬─────┐
│ a | b | c │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪═════╡
│ a | null | g │
├──────┼──────┼─────┤
│ null | null | h │
├──────┼──────┼─────┤
│ c | f | i │
└─//───┴─//───┴─//──┘,
shape: (3, 3)
┌─────┬─────┬─────┐
│ a | b | c │
│ --- | --- | --- │
│ str | str | str │
╞═════╪═════╪═════╡
│ AA | DD | GG │
├─────┼─────┼─────┤
│ BB | EE | HH │
├─────┼─────┼─────┤
│ CC | FF | II │
└─//──┴─//──┴─//──┘)
CodePudding user response:
You can use the coalesce
function in Polars, which also handles cases where you are coalescing more than two DataFrames.
import polars as pl
from datetime import date
small_sample = pl.DataFrame({
'col_bool': [False, None, None],
'col_float': [1.0, None, None],
'col_dt': [date(2022, 1, 1), None, None],
})
small_sample
medium_sample = pl.DataFrame({
'col_bool': [True, True, None],
'col_float': [10.0, 20.0, None],
'col_dt': [date(2023, 1, 1), date(2023, 2, 1), None],
})
medium_sample
large_sample = pl.DataFrame({
'col_bool': [True, True, True],
'col_float': [100.0, 200.0, 300.0],
'col_dt': [date(2024, 1, 1), date(2024, 2, 1), date(2024, 3, 1)],
})
large_sample
>>> small_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt │
│ --- ┆ --- ┆ --- │
│ bool ┆ f64 ┆ date │
╞══════════╪═══════════╪════════════╡
│ false ┆ 1.0 ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────────┴───────────┴────────────┘
>>> medium_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt │
│ --- ┆ --- ┆ --- │
│ bool ┆ f64 ┆ date │
╞══════════╪═══════════╪════════════╡
│ true ┆ 10.0 ┆ 2023-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true ┆ 20.0 ┆ 2023-02-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────────┴───────────┴────────────┘
>>> large_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt │
│ --- ┆ --- ┆ --- │
│ bool ┆ f64 ┆ date │
╞══════════╪═══════════╪════════════╡
│ true ┆ 100.0 ┆ 2024-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true ┆ 200.0 ┆ 2024-02-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true ┆ 300.0 ┆ 2024-03-01 │
└──────────┴───────────┴────────────┘
pl.select([
pl.coalesce([small, medium, large])
for small, medium, large in zip(small_sample, medium_sample, large_sample)
])
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt │
│ --- ┆ --- ┆ --- │
│ bool ┆ f64 ┆ date │
╞══════════╪═══════════╪════════════╡
│ false ┆ 1.0 ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true ┆ 20.0 ┆ 2023-02-01 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ true ┆ 300.0 ┆ 2024-03-01 │
└──────────┴───────────┴────────────┘