I have a pandas dataframe as shown below:
I want to drop the rows that has only one non zero value. What's the most efficient way to do this?
CodePudding user response:
Try boolean indexing
# sample data
df = pd.DataFrame(np.zeros((10, 10)), columns=list('abcdefghij'))
df.iloc[2:5, 3] = 1
df.iloc[4:5, 4] = 1
# boolean indexing based on condition
df[df.ne(0).sum(axis=1).ne(1)]
Only rows 2 and 3 are removed because row 4 has two non-zero values and every other row has zero non-zero values. So we drop rows 2 and 3.
df.ne(0).sum(axis=1)
0 0
1 0
2 1
3 1
4 2
5 0
6 0
7 0
8 0
9 0
CodePudding user response:
Not sure if this is the most efficient but I'll try:
df[[col for col in df.columns if (df[col] != 0).sum() == 1]]
2 loops per column here: 1 for checking if != 0 and one more to sum the boolean values up (could break earlier if the second value is found).
Otherwise, you can define a custom function to check without looping twice per column:
def check(column):
already_has_one = False
for value in column:
if value != 0:
if already_has_one:
return False
already_has_one = True
return already_has_one
then:
df[[col for col in df.columns if check(df[col])]]
Which is much faster than the first.
CodePudding user response:
Or like this:
df[(df.applymap(lambda x: bool(x)).sum(1) > 1).values]