I am trying to figure out a code where all rows of the same 'SCU_KEY' are deleted if the 'STATUS' == 0. So you will see that SCU_KEY -> 5 has a 0 in the status, so I want to delete all of the SCU_KEY's containing a 5. Here is a sample dataframe and the desired output.
Dataframe:
df = pd.DataFrame({'SCU_KEY': [3, 3, 3, 5, 5, 5, 5, 5, 16, 16, 16],
'STATUS' : [1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]})
Desired output:
df_2 = pd.DataFrame({'SCU_KEY': [3, 3, 3, 16, 16, 16],
'STATUS' : [1, 1, 1, 1, 1, 1]})
CodePudding user response:
Use a first filter to find SCU_KEY where STATUS equals 0 then check all rows where SCU_KEY is not equal to your first filter.
>>> df[~df['SCU_KEY'].isin(df.loc[df['STATUS'] == 0, 'SCU_KEY'])]
SCU_KEY STATUS
0 3 1
1 3 1
2 3 1
8 16 1
9 16 1
10 16 1
First filter:
>>> df.loc[df['STATUS'] == 0, 'SCU_KEY']
5 5
Name: SCU_KEY, dtype: int64
CodePudding user response:
Use groupby
filter
# filter out all 'SCU_KEY' groups
# that have at least one 'STATUS' == 0
df2 = df.groupby('SCU_KEY').filter(lambda g: ~g['STATUS'].eq(0).any())
EDIT - Performance Test
Although I find this solution somehow more idiomatic, Corralien's solution is away faster if your DataFrame is large.
Setup
n = 500_000
max_groups = 20
df1 = pd.DataFrame({
'SCU_KEY': rng.integers(max_groups, size=n),
'STATUS': rng.integers(2, size=n)
})
Results
Here are the results for comparison
# Corralien's
>>> %timeit df1[~df1['SCU_KEY'].isin(df1.loc[df1['STATUS'] == 0, 'SCU_KEY'])]
15.2 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# My solution
>>> %timeit df1.groupby('SCU_KEY').filter(lambda g: ~g['STATUS'].eq(0).any())
59.4 ms ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Solution suggested by wwnde (see comments)
>>> %timeit df1[df1.groupby('SCU_KEY')['STATUS'].transform(lambda x: (x!=0).all())]
210 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)