Deleting multiple rows depending on one row value-CodePudding

I am trying to figure out a code where all rows of the same 'SCU_KEY' are deleted if the 'STATUS' == 0. So you will see that SCU_KEY -> 5 has a 0 in the status, so I want to delete all of the SCU_KEY's containing a 5. Here is a sample dataframe and the desired output.

Dataframe:

df = pd.DataFrame({'SCU_KEY': [3, 3, 3, 5, 5, 5, 5, 5, 16, 16, 16],
                   'STATUS' : [1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]})

Desired output:

df_2 = pd.DataFrame({'SCU_KEY': [3, 3, 3, 16, 16, 16],
                     'STATUS' : [1, 1, 1, 1, 1, 1]})

CodePudding user response：

Use a first filter to find SCU_KEY where STATUS equals 0 then check all rows where SCU_KEY is not equal to your first filter.

>>> df[~df['SCU_KEY'].isin(df.loc[df['STATUS'] == 0, 'SCU_KEY'])]
    SCU_KEY  STATUS
0         3       1
1         3       1
2         3       1
8        16       1
9        16       1
10       16       1

First filter:

>>> df.loc[df['STATUS'] == 0, 'SCU_KEY']
5    5
Name: SCU_KEY, dtype: int64

CodePudding user response：

Use groupby filter

# filter out all 'SCU_KEY' groups 
# that have at least one 'STATUS' == 0
df2 = df.groupby('SCU_KEY').filter(lambda g: ~g['STATUS'].eq(0).any())

EDIT - Performance Test

Although I find this solution somehow more idiomatic, Corralien's solution is away faster if your DataFrame is large.

Setup

n = 500_000
max_groups = 20
df1 = pd.DataFrame({
    'SCU_KEY': rng.integers(max_groups, size=n),
    'STATUS': rng.integers(2, size=n)
})

Results

Here are the results for comparison

# Corralien's 
>>> %timeit df1[~df1['SCU_KEY'].isin(df1.loc[df1['STATUS'] == 0, 'SCU_KEY'])]

15.2 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

# My solution
>>> %timeit df1.groupby('SCU_KEY').filter(lambda g: ~g['STATUS'].eq(0).any())

59.4 ms ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Solution suggested by wwnde (see comments)
>>> %timeit df1[df1.groupby('SCU_KEY')['STATUS'].transform(lambda x: (x!=0).all())]

210 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)