Drop rows if there are NaN values in a columns for specific years-CodePudding

I'm trying to drop rows where there are NaN values in columns 1, 2, and 3 for the years 2019 and 2020.

Here is the dataframe:

col1     col2     col3     col4     year
NaN      NaN      NaN      NaN      2018
NaN      NaN      NaN      NaN      2019
NaN      NaN      NaN      100      2019
100      200      100      NaN      2020
NaN      NaN      NaN      200      2020
100      100      150      150      2021

Expected result:

col1     col2     col3     col4     year
NaN      NaN      NaN      NaN      2018
100      200      100      NaN      2020
100      100      150      150      2021

I tried this but it couldn't get it to work:

df.drop(df[(df['year'] == [2019, 2020]) 
                            & (df[['col1', 'col2', 'col3']].isnull())].index, inplace=True)

CodePudding user response：

You need to use isin to check for several values, and all to aggregate you boolean DataFrame into a single Series:

# is the year 2019 or 2020?
m1 = df['year'].isin([2019, 2020])
# are all the values in col1/col2/col3 null for a given row?
m2 = df[['col1', 'col2', 'col3']].isnull().all(1)

# invert the above two conditions to select rows to keep
df = df[~(m1&m2)]

# or
# df.drop(df[m1&m2].index, inplace=True)

output:

    col1   col2   col3   col4  year
0    NaN    NaN    NaN    NaN  2018
3  100.0  200.0  100.0    NaN  2020
5  100.0  100.0  150.0  150.0  2021

CodePudding user response：

df.loc[~(df[["col1", "col2", "col3"]].isna().all(axis="columns")
         & df["year"].isin([2019, 2020]))]

Conditions are:

NaN values in columns 1, 2, and 3

df[["col1", "col2", "col3"]].isna().all(axis="columns")

years 2019 and 2020

df["year"].isin([2019, 2020])

Then & these, invert the result and index with it; we will get the desired rows at the end.

CodePudding user response：

As you are filtering based on both year and nan values - you could probably split your dataframe and rejoin

msk = df['year'].isin([2019, 2020]) # drop where True, keep the complement as-is
df_oth = df[~msk]
df_req = df[msk].dropna(how='all', subset=['col1', 'col2', 'col3'])
df_out = pd.concat([df_oth, df_req])

Output

    col1   col2   col3   col4  year
0    NaN    NaN    NaN    NaN  2018
5  100.0  100.0  150.0  150.0  2021
3  100.0  200.0  100.0    NaN  2020