Home > Software design >  Search with a list of terms in a dataframe column where rows are lists of eligible terms
Search with a list of terms in a dataframe column where rows are lists of eligible terms

Time:09-19

I would like to filter a dataframe by using a list of search terms. I need to check if any of the terms in the list filter_by are present in one of the rows of my dataframe. Each row in the dataframe contains a list of eligible terms.

I tried several hacks and more verbose solutions that kind of work. I wonder if there is a simple and possibly vectorized solutions to this.

Example list of search terms:

filter_by = ["CSV", "JPG", "TXT"]

Example dataframe:

    data
0   [JPG, XML, PDF, TXT, XLS, XLSX]
1   [XLS, JPG, PDF, XLSX]
2   [JPG, PDF, TXT]
3   [XLS, PDF, XML, JPG]
4   [CSV]
5   [PDF, TXT, XLS]
6   [XML]
7   [JPG]
8   [PDF, CSV, XML, XLS, TXT, JPG]
9   [TXT, CSV, XLS, XLSX, PDF, JPG]

The resulting dataframe should only contain the rows where one or more of the elements of filter_by is present (OR condition).

I tried:

  • Flattening each dataframe row to a string and using str.contains("|".join(filter_by). This works. But it seems strange to me that I have to change the data structure fundamentally to achieve such a seemingly simple task.
  • .apply() and a custom function. This works but seems verbose.
  • .query() doesn't seem to be a solution. Neither is something like df[df.data.isin(filter_by)].
  • I also searched for similar questions on SO, to no avail.

Thanks in advance. Any help is much appreciated.

CodePudding user response:

You may use not set(filter_by).isdisjoint inside apply:

filter_by = ["CSV", "JPG", "TXT"]
df[~df.data.apply(set(filter_by).isdisjoint)]
  • Related