I am doing a location filter on a dataset column. I am using something similar to:
locations_list = ["london", "new york", "berlin", "texas"]
I am then filtering the column using
df[df["location"].str.contains(r'\b(?:{})\b'.format('|'.join(location_list))]
It works but it picks up places like "new zealand", assuming because its searching for new seperately. Is there a way of fixing this?
Example DF
locations = ["new york", "vegas", "london, england", "amsterdam", "usa, new york", "new zealand" ]
df = pd.DataFrame({"Location": locations})
CodePudding user response:
I would use pandas isin
df[df['location'].isin(locations_list)]
CodePudding user response:
One-liner
As a (not-recommended) one-liner:
df[df.Location.str.split(', ').apply(lambda r: bool(set(r).intersection(set(locations_list))))]
More maintainable
Or more maintainable:
import pandas as pd
locations_list = ["london", "new york", "berlin", "texas"]
locations = ["new york", "vegas", "london, england", "amsterdam", "usa, new york", "new zealand" ]
df = pd.DataFrame({"Location": locations})
Encapsulating the logic in a function which takes the list of locations to match against. I would rather specify : set
as the input type, though:
def make_location_searcher(given_locations: list):
def contains_location(df_row: str) -> bool:
row_locs = set(df_row.split(', '))
given_locs = set(given_locations)
return bool(row_locs.intersection(given_locs))
return contains_location
With that, the following two lines do the searching:
is_in_these_4_places = make_location_searcher(locations_list)
df.Location.apply(is_in_these_4_places)
Slight refactoring
Upon further thought, the function is more useful if the return type is not simply bool
, but the actual intersecting set:
def make_location_searcher(given_locations: list):
def matching_locations(df_row: str):
row_locs = set(df_row.split(', '))
given_locs = set(given_locations)
return row_locs.intersection(given_locs)
return matching_locations
With that, you not only can retreive the correct lines, but actually see which location(s) matched per row:
>>> lnbt_searcher = make_location_searcher(locations_list)
>>> df.Location.apply(lnbt_searcher)
0 {new york}
1 {}
2 {london}
3 {}
4 {new york}
5 {}
Name: Location, dtype: object
To get the original filter behavior, simply cast this series to bool:
>>> df[df.Location.apply(lnbt_searcher).astype(bool)]
Location
0 new york
2 london, england
4 usa, new york