Say I have a dataframe defined as
pd.DataFrame({'col1': ['foo', '', '', 'foo', 'quux', 'baz', 'baz', 'baz'],
'col2': ['', 'gb', '', 'de', 'gb', '', 'es', 'es'],
'col3': [123, float("NaN"), 456, 723, 456, 123, 123, 721],
'col4': ['', '', 'val1', 'val2', 'val3', '', 'val4', 'val5'],
'value': [1, 1, .4, .5, .3, 1, .5, .4]})
Which looks like
index | col1 | col2 | col3 | col4 | value |
---|---|---|---|---|---|
0 | foo | 123.0 | 1.0 | ||
1 | gb | NaN | 1.0 | ||
2 | 456.0 | val1 | 0.4 | ||
3 | foo | de | 723.0 | val2 | 0.5 |
4 | quux | gb | 456.0 | val3 | 0.3 |
5 | baz | 123 | 1 | ||
6 | baz | es | 123 | val4 | .5 |
7 | baz | es | 721 | val5 | 0.4 |
I would like to filter this table and remove any rows where the value is equal to 1.0, but also any rows that have the same values in the populated columns as the value==1.0 rows. So in the above table, we would remove rows 0, 1, and 5 since the value==1.0, and also remove row 3 because col1=='foo' and row 4 because col2=='gb', and row 6 because col1='baz' AND col3=123. Rows 2 and 7 should be retained.
index | col1 | col2 | col3 | col4 | value |
---|---|---|---|---|---|
2 | 456.0 | val1 | 0.4 | ||
7 | baz | es | 721 | val5 | 0.4 |
What's the best way to do this? I could find all the rows where the value==1.0 and then iterate through them and filter out all the rows from the table that have the same values in the set columns, but iterating through dataframe rows isn't ideal. I also thought of doing a merge, but I'm also not sure how to tell a merge to ignore columns where there is no value set.
CodePudding user response:
Let us do
cond = df.loc[df.value==1,]
filter = df[~(df.col1.isin(cond.col1[cond.col1!=''])|df.col2.isin(cond.col2[cond.col2!='']))]
filter
Out[443]:
col1 col2 col3 col4 value
2 456.0 val1 0.4
CodePudding user response:
I'd suggest doing a treatment per-columns.
# First get rows where value is 1
temp = df.query('value == 1')
# Then, collect all unique values from the columns of interest.
vals1, vals2 = temp.col1[temp.col1.ne('')].unique(), temp.col2[temp.col2.ne('')].unique()
# Finally, filter.
df.loc[~(np.isin(df.col1, vals1) | np.isin(df.col2, vals2))]
CodePudding user response:
I usually go with binary slicing with numpy
as this is straight forward and (for me) most readable:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['foo', '', '', 'foo', 'quux'],
'col2': ['', 'gb', '', 'de', 'gb'],
'col3': [123, float("NaN"), 456, 723, 456],
'col4': ['', '', 'val1', 'val2', 'val3'],
'value': [1, 1, .4, .5, .3]})
target = pd.Series({'value': 1.0, 'col1': 'foo', 'col2': 'gb'})
# determine which rows meet the target specifications
lg = np.all(df[target.index] == target, axis=1)
# using slicing
df = df[~lg]
# using drop
df.drop(lg[lg].index)
the good thing about this is that your are flexible with regard how to proceed with the logical vector lg
or the interesting indices lg[lg].index
=)
CodePudding user response:
You can do:
s = set(filter(lambda x:len(str(x)) > 0,
np.ravel(df.loc[df['value'].eq(1.0)].fillna('')[['col1', 'col2']].values)))
df = df[~(df['col1'].isin(s) | df['col2'].isin(s))]
CodePudding user response:
This should do the work:
eq1 = df[df['value'].eq(1)].replace('', float("NaN"))
df[~df.apply(lambda x: (eq1 == x).any(axis=None), axis=1)]