Beginner in Pandas here and for a project I am getting dataframes likes this:
import numpy as np
import pandas as pd
data = [[np.nan, np.nan, np.nan, np.nan, "x", np.nan],
[np.nan, np.nan, np.nan, np.nan, "y", np.nan],
[np.nan, np.nan, np.nan, np.nan, "z", np.nan]]
cols = [("column1", "first"), ("column1", "last"), ("column2", "first"), ("column2", "last"), ("column3", "first"), ("column3", "last")]
df = pd.Dataframe(data, columns=cols)
I am getting pairs of columns and want to drop all columns where all elements of both parts of the pair are NaN. For the given code I'd expect the last column pair. Expected result:
column3 / first | column3 / last |
---|---|
x | nan |
y | nan |
z | nan |
I have tried pandas.dropna() with different subsets but I cant get it to work that the last column is not deleted. The name of the columns are variable but always 2 levels deep.
Can some Pandas-God help this poor soul?
CodePudding user response:
First need MultiIndex
in columns:
df = pd.DataFrame(data, columns=pd.MultiIndex.from_tuples(cols))
#if nott create it
#df = pd.Dataframe(data, columns=cols)
#df.columns = pd.MultiIndex.from_tuples(df.columns)
For each first level values test if all values are missing by GroupBy.transform
and then test if at least one True
per columns by DataFrame.any
, chain inverted mask by ~
and |
for bitwise OR
and filter in DataFrame.loc
:
m1 = ~df.isna().groupby(level=0, axis=1).transform('all').any()
m2 = ~df.isna().any()
df = df.loc[:, m1 | m2]
print(df)
column3
first last
0 x NaN
1 y NaN
2 z NaN