I have a list of tuples with accepted combinations e.g
valid_comb = [(1,"foo"),(1,"bar"),(2,"foo"),(3,"bar")]
and a DataFrame
import pandas as pd
df = pd.DataFrame({"id": [1, 1, 1, 2, 3, 1, 2, 3, 3],
"name": ["foo", "bar", "qiu", "bar", "bar", "foo", "foo", "foz", "foo"],
"type": ["car", "bike", "car", "boat", "bike", "car", "car", "plane", "plane"]})
print(df)
# id name type
# 0 1 foo car
# 1 1 bar bike
# 2 1 qiu car
# 3 2 bar boat
# 4 3 bar bike
# 5 1 foo car
# 6 2 foo car
# 7 3 foz plane
# 8 3 foo plane
and I want the rows where the combination of (id,"name")
is in the valid_comb
list i.e the resulting dataframe would be
keep_idx = some_smart_function(df)
df_keep = df.loc[keep_idx]
print(df_keep)
# id name type
# 0 1 foo car
# 1 1 bar bike
# 4 3 bar bike
# 5 1 foo car
# 6 2 foo car
(note, my original dataframe has around 1.7 mio rows and around 20_000 columns, thus this is just a dummy example)
I have tried the following
valid_id = [p[0] for p in valid_comb]
valid_type = [p[1] for p in valid_comb]
keep_idx = (df["id"]==valid_id) & (df["type"] == valid_type) # Not same length
keep_idx = (df["id"].isin(valid_id)) & (df["type"].isin(valid_type)) # Returns combinations that are not in valid_comb
I have made a solution which works
keep_idx = [(i, t) in valid_comb for (i, t) in zip(
df["id"], df["name"])]
but that takes quite a while, thus I wonder if theres a vectorized version of this in pandas I have missed
CodePudding user response:
First idea is match MultiIndex
with list of tuples:
df = df[df.set_index(['id','name']).index.isin(valid_comb)]
print (df)
id name type
0 1 foo car
1 1 bar bike
4 3 bar bike
5 1 foo car
6 2 foo car
Or create helper DataFrame with join:
df = df.merge(pd.DataFrame(valid_comb, columns=['id','name']))
print (df)
id name type
0 1 foo car
1 1 foo car
2 1 bar bike
3 3 bar bike
4 2 foo car
CodePudding user response:
You could use a temporary index:
df.set_index(['id', 'name']).loc[valid_comb].reset_index()
output:
id name type
0 1 foo car
1 1 foo car
2 1 bar bike
3 2 foo car
4 3 bar bike