Home > Mobile >  How to filter rows from dataframe depending on contents of other rows?
How to filter rows from dataframe depending on contents of other rows?

Time:05-06

Say I have a dataframe defined as

pd.DataFrame({'col1': ['foo', '', '', 'foo', 'quux', 'baz', 'baz', 'baz'],
              'col2': ['', 'gb', '', 'de', 'gb', '', 'es', 'es'],
              'col3': [123, float("NaN"), 456, 723, 456, 123, 123, 721],
              'col4': ['', '', 'val1', 'val2', 'val3', '', 'val4', 'val5'],
              'value': [1, 1, .4, .5, .3, 1, .5, .4]})

Which looks like

index col1 col2 col3 col4 value
0 foo 123.0 1.0
1 gb NaN 1.0
2 456.0 val1 0.4
3 foo de 723.0 val2 0.5
4 quux gb 456.0 val3 0.3
5 baz 123 1
6 baz es 123 val4 .5
7 baz es 721 val5 0.4

I would like to filter this table and remove any rows where the value is equal to 1.0, but also any rows that have the same values in the populated columns as the value==1.0 rows. So in the above table, we would remove rows 0, 1, and 5 since the value==1.0, and also remove row 3 because col1=='foo' and row 4 because col2=='gb', and row 6 because col1='baz' AND col3=123. Rows 2 and 7 should be retained.

index col1 col2 col3 col4 value
2 456.0 val1 0.4
7 baz es 721 val5 0.4

What's the best way to do this? I could find all the rows where the value==1.0 and then iterate through them and filter out all the rows from the table that have the same values in the set columns, but iterating through dataframe rows isn't ideal. I also thought of doing a merge, but I'm also not sure how to tell a merge to ignore columns where there is no value set.

CodePudding user response:

Let us do

cond = df.loc[df.value==1,]
filter = df[~(df.col1.isin(cond.col1[cond.col1!=''])|df.col2.isin(cond.col2[cond.col2!='']))]
filter
Out[443]: 
  col1 col2   col3  col4  value
2            456.0  val1    0.4

CodePudding user response:

I'd suggest doing a treatment per-columns.

# First get rows where value is 1
temp = df.query('value == 1')

# Then, collect all unique values from the columns of interest.
vals1, vals2 = temp.col1[temp.col1.ne('')].unique(), temp.col2[temp.col2.ne('')].unique()

# Finally, filter.
df.loc[~(np.isin(df.col1, vals1) | np.isin(df.col2, vals2))]

CodePudding user response:

I usually go with binary slicing with numpy as this is straight forward and (for me) most readable:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': ['foo', '', '', 'foo', 'quux'],
                'col2': ['', 'gb', '', 'de', 'gb'],
                'col3': [123, float("NaN"), 456, 723, 456],
                'col4': ['', '', 'val1', 'val2', 'val3'],
                'value': [1, 1, .4, .5, .3]})

target = pd.Series({'value': 1.0, 'col1': 'foo', 'col2': 'gb'})

# determine which rows meet the target specifications
lg = np.all(df[target.index] == target, axis=1)

# using slicing
df = df[~lg]
# using drop
df.drop(lg[lg].index)

the good thing about this is that your are flexible with regard how to proceed with the logical vector lg or the interesting indices lg[lg].index =)

CodePudding user response:

You can do:

s = set(filter(lambda x:len(str(x)) > 0, 
           np.ravel(df.loc[df['value'].eq(1.0)].fillna('')[['col1', 'col2']].values)))
df = df[~(df['col1'].isin(s) | df['col2'].isin(s))]

CodePudding user response:

This should do the work:

eq1 = df[df['value'].eq(1)].replace('', float("NaN"))

df[~df.apply(lambda x: (eq1 == x).any(axis=None), axis=1)]
  • Related