I have a data set like this:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
A NaN 4 5 NaN 4 7 NaN 6 8
B 7 2 9 2 1 9 2 1 1
C NaN 4 4 2 4 NaN 2 6 8
D 5 NaN NaN 2 5 9 NaN 1 1
I want to remove the rows where at least three of the columns marked 'T01' are NaN
So the output would be:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B 7 2 9 2 1 9 2 1 1
C NaN 4 4 2 4 NaN 2 6 8
D 5 NaN NaN 2 5 9 NaN 1 1
Because the A row there is NaN is S01-T01, S02-T02, S03-T01. Row D also has three NaNs, but it is kept in because I am only interested in removing the rows if specifically there is >=3 NaN in the column names that have a T01 in them.
I know this could be simple to do, I wrote:
import sys
import pandas as pd
df = pd.read_csv('data.csv',sep=',')
print(df.columns.str.contains['T01'])
To first get all of the cells with T01 in them, and then I was going to count them.
I got the error:
print(df.columns.str.contains['T01'])
TypeError: 'method' object is not subscriptable
Then I thought about iterating through the rows and counting instead e.g.:
for index,row in df.iterrows():
if 'T01' in row:
print(row)
This runs without error but prints nothing to screen. Could someone demonstrate a better way to do this?
CodePudding user response:
If you select only the 'T01' columns, you can take the rowwise sum of nulls and keep only rows that are less than 3.
df.loc[df[[x for x in df if 'T01' in x]].isnull().sum(1).lt(3)]