Home > Back-end >  Filtering a dataset from a list of words
Filtering a dataset from a list of words

Time:08-17

I am doing a location filter on a dataset column. I am using something similar to:

locations_list = ["london", "new york", "berlin", "texas"]

I am then filtering the column using

df[df["location"].str.contains(r'\b(?:{})\b'.format('|'.join(location_list))]

It works but it picks up places like "new zealand", assuming because its searching for new seperately. Is there a way of fixing this?

Example DF

locations = ["new york", "vegas", "london, england", "amsterdam", "usa, new york", "new zealand" ]
df = pd.DataFrame({"Location": locations})

CodePudding user response:

I would use pandas isin

df[df['location'].isin(locations_list)]

CodePudding user response:

One-liner

As a (not-recommended) one-liner:

df[df.Location.str.split(', ').apply(lambda r: bool(set(r).intersection(set(locations_list))))]

More maintainable

Or more maintainable:

import pandas as pd

locations_list = ["london", "new york", "berlin", "texas"]
locations = ["new york", "vegas", "london, england", "amsterdam", "usa, new york", "new zealand" ]
df = pd.DataFrame({"Location": locations})

Encapsulating the logic in a function which takes the list of locations to match against. I would rather specify : set as the input type, though:

def make_location_searcher(given_locations: list):
    def contains_location(df_row: str) -> bool:
        row_locs = set(df_row.split(', '))
        given_locs = set(given_locations)
        return bool(row_locs.intersection(given_locs))
    return contains_location

With that, the following two lines do the searching:

is_in_these_4_places = make_location_searcher(locations_list)

df.Location.apply(is_in_these_4_places)

Slight refactoring

Upon further thought, the function is more useful if the return type is not simply bool, but the actual intersecting set:

def make_location_searcher(given_locations: list):
    def matching_locations(df_row: str):
        row_locs = set(df_row.split(', '))
        given_locs = set(given_locations)
        return row_locs.intersection(given_locs)
    return matching_locations

With that, you not only can retreive the correct lines, but actually see which location(s) matched per row:

>>> lnbt_searcher = make_location_searcher(locations_list)
>>> df.Location.apply(lnbt_searcher)
0    {new york}
1            {}
2      {london}
3            {}
4    {new york}
5            {}
Name: Location, dtype: object

To get the original filter behavior, simply cast this series to bool:

>>> df[df.Location.apply(lnbt_searcher).astype(bool)]
          Location
0         new york
2  london, england
4    usa, new york
  • Related