I'd like to filter my dataframe such that only rows that have a column containing a substring of another string are selected. I know that the opposite can be done like this:
selection = df[df.str.contains(substring)]
But how would I do it such that the substring is in the dataframe and it is compared to another string. What I've tried are
import pandas
a = pandas.DataFrame({"b":["foo","bar"]})
selection = a[a.b.str in "foot"] # should match first row
selection = a[a.b.str.isin("foot")] # should match first row
selection = a[a.b.str.isin("foobar")] # should match both rows
but these won't work
CodePudding user response:
You can do this.
import pandas as pd
a = pd.DataFrame({"b":["foo","bar"]})
selection = a[a.b.apply(lambda x: x in "foot")] # should match first row
selection = a[a.b.apply(lambda x: x in "foot")] # should match first row
selection = a[a.b.apply(lambda x: x in "foobar")] # should match both rows
CodePudding user response:
What you could try is make function like so:
f = lambda substring: substring in string
Where string
is your target string. Then, apply this function elementwise to the df.
df.applymap(f)
This should return a DataFrame with True where the element is a substring of your target string, and False where it isn't. To find out which rows contain a True, i.e. contain elements that are substrings of your target string, us df.any()
:
df.applymap(lambda substring: substring in string).any(axis = 1)
If you want to just select these rows use slicing:
df[df.applymap(lambda substring: substring in string).any(axis = 1)]