Pandas filter rows based on certain number of certain columns being NaN-CodePudding

I have a data set like this:

seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
A   NaN       4       5       NaN     4       7       NaN       6       8
B   7         2       9       2       1       9       2         1       1 
C   NaN       4       4       2       4       NaN     2         6       8
D   5         NaN     NaN     2       5       9       NaN       1       1

I want to remove the rows where at least three of the columns marked 'T01' are NaN

So the output would be:

seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B   7         2       9       2       1       9       2         1       1 
C   NaN       4       4       2       4       NaN     2         6       8
D   5         NaN     NaN     2       5       9       NaN       1       1

Because the A row there is NaN is S01-T01, S02-T02, S03-T01. Row D also has three NaNs, but it is kept in because I am only interested in removing the rows if specifically there is >=3 NaN in the column names that have a T01 in them.

I know this could be simple to do, I wrote:

import sys
import pandas as pd

df = pd.read_csv('data.csv',sep=',')
print(df.columns.str.contains['T01'])

To first get all of the cells with T01 in them, and then I was going to count them.

I got the error:

    print(df.columns.str.contains['T01'])
TypeError: 'method' object is not subscriptable

Then I thought about iterating through the rows and counting instead e.g.:

for index,row in df.iterrows():
        if 'T01' in row:
                print(row)

This runs without error but prints nothing to screen. Could someone demonstrate a better way to do this?

CodePudding user response：

If you select only the 'T01' columns, you can take the rowwise sum of nulls and keep only rows that are less than 3.

df.loc[df[[x for x in df if 'T01' in x]].isnull().sum(1).lt(3)]