Suppose the next df:
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[0, 0.001, 0, 0, 0, 0],
'age':[24, 22, 45, 38, 60, 32]}
df=pd.DataFrame(d)
When trying to get the variance for all columns the next result arises:
import pandas as pd
df.agg("var")
Result:
number 1.666667e-07
age 2.025667e 02
dtype: float64
The idea is to remove columns which are in a certain rank, say, drop it if column variance
is in a rank between 0
and 0.0001
, (i.e: drop number
column as it's variance is within this rank).
When tried to do this:
df= df.loc[:, 0 < df.std() < .0001]
The next error arises:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is it possible to drop pandas df column whose variance is within a tolerance rank?
CodePudding user response:
You can't use chained comparison operator with pandas series, since under the hood, they are translated to and
which only works with scalars. Use vectorized &
instead:
df[(((v := df.var()) > 0) & (v < 0.0001))[lambda x: x].index]
number
0 0.000
1 0.001
2 0.000
3 0.000
4 0.000
5 0.000
CodePudding user response:
Another solution (using .between
.drop(columns=...)
)
var = df.agg("var", numeric_only=True)
df = df.drop(columns=var[var.between(0, 0.0001)].index)
print(df)
Prints:
month country level job title age
0 01/01/2020 Japan A01 Insights Manager 24
1 01/02/2020 Japan A01 Insights Manager 22
2 01/03/2020 Japan A01 Insights Manager 45
3 01/01/2020 Poland A00 Sales Director 38
4 01/02/2020 Poland A00 Sales Director 60
5 01/03/2020 Poland A00 Sales Director 32