I'm a new polars user. Pandas has df.dropna. I need to replace this functionality, but I haven't found a dropna in polars. Searching for drona
currently yields no results in the Polars User Guide.
My specific problem: convert the following statement from pandas to polars
df.dropna(subset=list_of_vars, thresh=1)
I think I should use df.filter
. I do not know a-priori what will be in list_of_vars
, so making a set of |
filters is a bit tricky. all vars in list_of_vars
are columns in the dataframe
input
import polars as pl
df = pl.DataFrame(
{
'col1':[0,float('nan'),2,3],
'col2':[0,float('nan'),float('nan'),3],
'col3':[0,1,2,3],
'col3':[float('nan'),float('nan'),float('nan'),float('nan')]
}
)
df
list_of_vars = ['col1', 'col2']
Desired output:
Keep only rows in the dataframe where the value of at least one of the columns in list_of_vars
is not NaN/null.
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 ┆ f64 │
╞══════╪══════╪══════╪══════╡
│ 0.0 ┆ 0.0 ┆ 0 ┆ NaN │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ NaN ┆ 2 ┆ NaN │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ 3.0 ┆ 3 ┆ NaN │
└──────┴──────┴──────┴──────┘
In this case, rows with a good value from col1 or col2 are retained. Row 3, with a good value only from col3, is dropped.
CodePudding user response:
Polars has is_nan
and is_not_nan
expressions as well as is_null
,is_not_null
.
We need to count the number of NaNs per row, turn this into a column and then filter by this column with the thresh
import polars as pl
df = pl.DataFrame(
{
'col1':[0,float('nan'),2,3],
"col2":[0,float('nan'),float('nan'),3]
}
)
list_of_vars = ["col1","col2"]
thresh = 1
df
shape: (4, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════╪══════╡
│ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ NaN ┆ NaN │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ NaN │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ 3.0 │
└──────┴──────┘
(
df
.with_column(
df.select(pl.col(list_of_vars).is_nan().cast(pl.Int64)).sum(axis=1).alias("na_count")
)
.filter(pl.col("na_count") <= thresh)
)
shape: (3, 3)
┌──────┬──────┬──────────┐
│ col1 ┆ col2 ┆ na_count │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 │
╞══════╪══════╪══════════╡
│ 0.0 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2.0 ┆ NaN ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3.0 ┆ 3.0 ┆ 0 │
└──────┴──────┴──────────┘```
CodePudding user response:
Current best attempt:
var_a, var_b = list_of_vars
df.filter(
pl.col(var_a).is_not_null() | (pl.col(var_b).is_not_null())
)
Works if I know the number of variables in list_of_vars
but the number of variables changes on each run