Home > OS >  Removing rows based on multiple lists in Python
Removing rows based on multiple lists in Python

Time:12-09

I am trying to remove rows based on multiple list and a related column. The condition is that when the location string doesn't fall into the list of country, remove the rows. My current approach is to use np.select and I wonder if there's a way to make it more succinct or easier to re-use.

Europe = ["London", "Paris", "Berlin"]
Amercia = ["Boston", "New York", "Florida"]
Asia = ["Hong Kong", "Singapore", "Tokyo"]

data = {
    "country1": ["Europe", "Asia", "Asia", "Europe", "Europe", "Amercia", "Amercia",  "Asia", "Asia", "Europe"], 
    'location':["London", "London", "Hong Kong", "Hong Kong", "New York", "New York", "Singapore", "Singapore",  "London", "London"]}
df = pd.DataFrame(data)

   country   location
0   Europe     London
1     Asia     London
2     Asia  Hong Kong
3   Europe  Hong Kong
4   Europe   New York
5  Amercia   New York
6  Amercia  Singapore
7     Asia  Singapore
8     Asia     London
9   Europe     London

Current approach

conditions = [
    (df['location'].str.contains('|'.join(Europe))),
    (df['location'].str.contains('|'.join(Amercia))),
    (df['location'].str.contains('|'.join(Asia)))]

choices = ["Europe", "Amercia", "Asia"]
df['country2'] = np.select(conditions, choices, default=0)
df2 = df.query("country1 == country2").reset_index(drop = True)

Expected output:

   country   location
0   Europe     London
1     Asia  Hong Kong
2  Amercia   New York
3     Asia  Singapore
4   Europe     London

CodePudding user response:

If possible mapping location by values from list convertd to flatten dictionaries and comapred with original column use:

d = {"Europe":Europe, "Amercia":Amercia, "Asia":Asia}

d1 = {x:k for k, v in d.items() for x in v}
print (d1)
{'London': 'Europe', 'Paris': 'Europe', 'Berlin': 'Europe',
 'Boston': 'Amercia',  'New York': 'Amercia', 'Florida': 'Amercia', 
 'Hong Kong': 'Asia', 'Singapore': 'Asia', 'Tokyo': 'Asia'}


df1 = df[df['location'].map(d1).eq(df['country1'])].reset_index(drop = True)
print (df1)
  country1   location
0   Europe     London
1     Asia  Hong Kong
2  Amercia   New York
3     Asia  Singapore
4   Europe     London

If possible substring in original column is possible test first matched value by dictionary:

df1 = df[df['location'].str.extract(f'({"|".join(d1)})', expand=False).map(d1).eq(df['country1'])]
  • Related