I would like to filter a dataframe by using a list of search terms. I need to check if any of the terms in the list filter_by
are present in one of the rows of my dataframe. Each row in the dataframe contains a list of eligible terms.
I tried several hacks and more verbose solutions that kind of work. I wonder if there is a simple and possibly vectorized solutions to this.
Example list of search terms:
filter_by = ["CSV", "JPG", "TXT"]
Example dataframe:
data
0 [JPG, XML, PDF, TXT, XLS, XLSX]
1 [XLS, JPG, PDF, XLSX]
2 [JPG, PDF, TXT]
3 [XLS, PDF, XML, JPG]
4 [CSV]
5 [PDF, TXT, XLS]
6 [XML]
7 [JPG]
8 [PDF, CSV, XML, XLS, TXT, JPG]
9 [TXT, CSV, XLS, XLSX, PDF, JPG]
The resulting dataframe should only contain the rows where one or more of the elements of filter_by
is present (OR
condition).
I tried:
- Flattening each dataframe row to a string and using
str.contains("|".join(filter_by)
. This works. But it seems strange to me that I have to change the data structure fundamentally to achieve such a seemingly simple task. .apply()
and a custom function. This works but seems verbose..query()
doesn't seem to be a solution. Neither is something likedf[df.data.isin(filter_by)]
.- I also searched for similar questions on SO, to no avail.
Thanks in advance. Any help is much appreciated.
CodePudding user response:
You may use not set(filter_by).isdisjoint
inside apply
:
filter_by = ["CSV", "JPG", "TXT"]
df[~df.data.apply(set(filter_by).isdisjoint)]