Home > Enterprise >  polars dropna equivalent on list of columns
polars dropna equivalent on list of columns

Time:10-06

I'm a new polars user. Pandas has df.dropna. I need to replace this functionality, but I haven't found a dropna in polars. Searching for drona currently yields no results in the Polars User Guide.

My specific problem: convert the following statement from pandas to polars

df.dropna(subset=list_of_vars, thresh=1)

I think I should use df.filter. I do not know a-priori what will be in list_of_vars, so making a set of | filters is a bit tricky. all vars in list_of_vars are columns in the dataframe

input

import polars as pl

df = pl.DataFrame(
    {
        'col1':[0,float('nan'),2,3],
        'col2':[0,float('nan'),float('nan'),3],
        'col3':[0,1,2,3],
        'col3':[float('nan'),float('nan'),float('nan'),float('nan')]
    }
)
df
list_of_vars = ['col1', 'col2']

Desired output:

Keep only rows in the dataframe where the value of at least one of the columns in list_of_vars is not NaN/null.

┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ f64  ┆ f64  ┆ i64  ┆ f64  │
╞══════╪══════╪══════╪══════╡
│ 0.0  ┆ 0.0  ┆ 0    ┆ NaN  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0  ┆ NaN  ┆ 2    ┆ NaN  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0  ┆ 3.0  ┆ 3    ┆ NaN  │
└──────┴──────┴──────┴──────┘

In this case, rows with a good value from col1 or col2 are retained. Row 3, with a good value only from col3, is dropped.

CodePudding user response:

Polars has is_nan and is_not_nan expressions as well as is_null,is_not_null. We need to count the number of NaNs per row, turn this into a column and then filter by this column with the thresh

import polars as pl

df = pl.DataFrame(
    {
        'col1':[0,float('nan'),2,3],
        "col2":[0,float('nan'),float('nan'),3]
    }
)
list_of_vars = ["col1","col2"]
thresh = 1
df
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 0.0  ┆ 0.0  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ NaN  ┆ NaN  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0  ┆ NaN  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0  ┆ 3.0  │
└──────┴──────┘
(
    df
    .with_column(
df.select(pl.col(list_of_vars).is_nan().cast(pl.Int64)).sum(axis=1).alias("na_count")
 )
    .filter(pl.col("na_count") <= thresh)
)
shape: (3, 3)
┌──────┬──────┬──────────┐
│ col1 ┆ col2 ┆ na_count │
│ ---  ┆ ---  ┆ ---      │
│ f64  ┆ f64  ┆ i64      │
╞══════╪══════╪══════════╡
│ 0.0  ┆ 0.0  ┆ 0        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2.0  ┆ NaN  ┆ 1        │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3.0  ┆ 3.0  ┆ 0        │
└──────┴──────┴──────────┘```

CodePudding user response:

Current best attempt:

var_a, var_b = list_of_vars

df.filter(                            
pl.col(var_a).is_not_null() | (pl.col(var_b).is_not_null())
)

Works if I know the number of variables in list_of_vars but the number of variables changes on each run

  • Related