Filter out entire group if all values in group are zero-CodePudding

Using pandas, I want to filter out all groups that contain only zero values

So in pseudo-code something like this

df.groupby('my_group')['values'].filter(all(iszero))

Example input dataframe could be something like this

df = pd.DataFrame({'my_group': ['A', 'B', 'C', 'D']*3, 'values': [0 if (x % 4 == 0 or x == 11) else random.random() for x in range(12)]})

   my_group    values
0         A  0.000000
1         B  0.286104
2         C  0.359804
3         D  0.596152
4         A  0.000000
5         B  0.560742
6         C  0.534575
7         D  0.251302
8         A  0.000000
9         B  0.445010
10        C  0.750434
11        D  0.000000

Here, group A contains all zero values, so it should be filtered out. Group D also has a zero value in row 11, but in other rows it has non-zero values, so it shouldn't be filtered out

CodePudding user response：

Here are possible solution from the best to worse performance:

#filtere groups by != 0 and then filter again original column by mask
df1 = df[df['my_group'].isin(df.loc[df['values'].ne(0), 'my_group'])]

#create mask by groupy.transform
df1 = df[df['values'].ne(0).groupby(df['my_group']).transform('any')]

#filtered by lambda function (if large data it is slow)
df1 = df.groupby('my_group').filter(lambda x: x['values'].ne(0).any())
print (df1)
   my_group    values
1         B  0.286104
2         C  0.359804
3         D  0.596152
5         B  0.560742
6         C  0.534575
7         D  0.251302
9         B  0.445010
10        C  0.750434
11        D  0.000000

CodePudding user response：

IIUC use a condition to keep the rows. For this if any value in the group is not equal (ne) to zero, then keep the group:

df2 = df.groupby('my_group').filter(lambda g: g['values'].ne(0).any())

output:

   my_group    values
1         B  0.286104
2         C  0.359804
3         D  0.596152
5         B  0.560742
6         C  0.534575
7         D  0.251302
9         B  0.445010
10        C  0.750434
11        D  0.000000

Or to get only the indices:

idx = df.groupby('my_group')['values'].filter(lambda s: s.ne(0).any()).index

output: Int64Index([1, 2, 3, 5, 6, 7, 9, 10, 11], dtype='int64')

CodePudding user response：

You can use:

>>> df[df.groupby('my_group')['values'].transform('any')]
   my_group    values
1         B  0.507089
2         C  0.846842
3         D  0.953003
5         B  0.085316
6         C  0.482732
7         D  0.764508
9         B  0.879005
10        C  0.717571
11        D  0.000000