How to filter rows from dataframe depending on contents of other rows?-CodePudding

Say I have a dataframe defined as

pd.DataFrame({'col1': ['foo', '', '', 'foo', 'quux', 'baz', 'baz', 'baz'],
              'col2': ['', 'gb', '', 'de', 'gb', '', 'es', 'es'],
              'col3': [123, float("NaN"), 456, 723, 456, 123, 123, 721],
              'col4': ['', '', 'val1', 'val2', 'val3', '', 'val4', 'val5'],
              'value': [1, 1, .4, .5, .3, 1, .5, .4]})

Which looks like

index	col1	col2	col3	col4	value
0	foo		123.0		1.0
1		gb	NaN		1.0
2			456.0	val1	0.4
3	foo	de	723.0	val2	0.5
4	quux	gb	456.0	val3	0.3
5	baz		123		1
6	baz	es	123	val4	.5
7	baz	es	721	val5	0.4

I would like to filter this table and remove any rows where the value is equal to 1.0, but also any rows that have the same values in the populated columns as the value==1.0 rows. So in the above table, we would remove rows 0, 1, and 5 since the value==1.0, and also remove row 3 because col1=='foo' and row 4 because col2=='gb', and row 6 because col1='baz' AND col3=123. Rows 2 and 7 should be retained.

index	col1	col2	col3	col4	value
2			456.0	val1	0.4
7	baz	es	721	val5	0.4

What's the best way to do this? I could find all the rows where the value==1.0 and then iterate through them and filter out all the rows from the table that have the same values in the set columns, but iterating through dataframe rows isn't ideal. I also thought of doing a merge, but I'm also not sure how to tell a merge to ignore columns where there is no value set.

CodePudding user response：

Let us do

cond = df.loc[df.value==1,]
filter = df[~(df.col1.isin(cond.col1[cond.col1!=''])|df.col2.isin(cond.col2[cond.col2!='']))]
filter
Out[443]: 
  col1 col2   col3  col4  value
2            456.0  val1    0.4

CodePudding user response：

I'd suggest doing a treatment per-columns.

# First get rows where value is 1
temp = df.query('value == 1')

# Then, collect all unique values from the columns of interest.
vals1, vals2 = temp.col1[temp.col1.ne('')].unique(), temp.col2[temp.col2.ne('')].unique()

# Finally, filter.
df.loc[~(np.isin(df.col1, vals1) | np.isin(df.col2, vals2))]

CodePudding user response：

I usually go with binary slicing with numpy as this is straight forward and (for me) most readable:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': ['foo', '', '', 'foo', 'quux'],
                'col2': ['', 'gb', '', 'de', 'gb'],
                'col3': [123, float("NaN"), 456, 723, 456],
                'col4': ['', '', 'val1', 'val2', 'val3'],
                'value': [1, 1, .4, .5, .3]})

target = pd.Series({'value': 1.0, 'col1': 'foo', 'col2': 'gb'})

# determine which rows meet the target specifications
lg = np.all(df[target.index] == target, axis=1)

# using slicing
df = df[~lg]
# using drop
df.drop(lg[lg].index)

the good thing about this is that your are flexible with regard how to proceed with the logical vector lg or the interesting indices lg[lg].index =)

CodePudding user response：

You can do:

s = set(filter(lambda x:len(str(x)) > 0, 
           np.ravel(df.loc[df['value'].eq(1.0)].fillna('')[['col1', 'col2']].values)))
df = df[~(df['col1'].isin(s) | df['col2'].isin(s))]

CodePudding user response：

This should do the work:

eq1 = df[df['value'].eq(1)].replace('', float("NaN"))

df[~df.apply(lambda x: (eq1 == x).any(axis=None), axis=1)]