Pandas get rows where columns have specific combination-CodePudding

I have a list of tuples with accepted combinations e.g

valid_comb = [(1,"foo"),(1,"bar"),(2,"foo"),(3,"bar")]

and a DataFrame

import pandas as pd

df = pd.DataFrame({"id": [1, 1, 1, 2, 3, 1, 2, 3, 3],
                  "name": ["foo", "bar", "qiu", "bar", "bar", "foo", "foo", "foz", "foo"],
                   "type": ["car", "bike", "car", "boat", "bike", "car", "car", "plane", "plane"]})
print(df)

#    id  name   type
# 0  1   foo    car
# 1  1   bar    bike
# 2  1   qiu    car
# 3  2   bar    boat
# 4  3   bar    bike
# 5  1   foo    car
# 6  2   foo    car
# 7  3   foz    plane
# 8  3   foo    plane

and I want the rows where the combination of (id,"name") is in the valid_comb list i.e the resulting dataframe would be

keep_idx = some_smart_function(df)
df_keep = df.loc[keep_idx]

print(df_keep)

#    id  name   type
# 0  1   foo    car
# 1  1   bar    bike
# 4  3   bar    bike
# 5  1   foo    car
# 6  2   foo    car

(note, my original dataframe has around 1.7 mio rows and around 20_000 columns, thus this is just a dummy example)

I have tried the following

valid_id = [p[0] for p in valid_comb]
valid_type = [p[1] for p in valid_comb]

keep_idx = (df["id"]==valid_id) & (df["type"] == valid_type) # Not same length
keep_idx = (df["id"].isin(valid_id)) & (df["type"].isin(valid_type)) # Returns combinations that are not in valid_comb

I have made a solution which works

keep_idx = [(i, t) in valid_comb for (i, t) in zip(
    df["id"], df["name"])]

but that takes quite a while, thus I wonder if theres a vectorized version of this in pandas I have missed

CodePudding user response：

First idea is match MultiIndex with list of tuples:

df = df[df.set_index(['id','name']).index.isin(valid_comb)]
print (df)
   id name  type
0   1  foo   car
1   1  bar  bike
4   3  bar  bike
5   1  foo   car
6   2  foo   car

Or create helper DataFrame with join:

df = df.merge(pd.DataFrame(valid_comb, columns=['id','name']))
print (df)
   id name  type
0   1  foo   car
1   1  foo   car
2   1  bar  bike
3   3  bar  bike
4   2  foo   car

CodePudding user response：

You could use a temporary index:

df.set_index(['id', 'name']).loc[valid_comb].reset_index()

output:

   id name  type
0   1  foo   car
1   1  foo   car
2   1  bar  bike
3   2  foo   car
4   3  bar  bike