I have a df with columns product_id
and keyword
. I am iterating over product id by for group_name, group in s.groupby('product_id')
and in this step, I want to keep groups that contain all elements from the search list.
for example:
and my search list = ['land', 'cruiser']
so in the output, I want this:
I have tried this
data[data["keywords"].isin(search_words)]
but this method keeps all groups if it contains at least one element from the list, while I groups must contain all elements in the search list.
CodePudding user response:
The core logic here is lifted from S Rawson's answer.
Instead of a one-liner, I create some helper variables -- I make a series of product_id and all their associated search terms (product_keywords). Then I make a series (keepers) of all the product_id that have all the keywords. Finally I use the isin() method to test if dataframe rows have a product_id in that list.
This keeps all terms that are in groups that have all the sought terms; so if group 1 has "land" and "cruiser" and also other words, this gets all of the terms.
df = pd.DataFrame({"product_id": [1,1, 2, 3, 1, 2, 3],
"keywords": ["land", "barn", "land", "land", "cruiser", "barn", "cruiser"]})
search_list = set(['land', 'cruiser'])
product_keywords = df.groupby("product_id").apply(lambda x: list(set(x["keywords"])))
product_keywords
# product_id
# 1 [barn, cruiser, land]
# 2 [barn, land]
# 3 [cruiser, land]
keepers = product_keywords[product_keywords.apply(lambda found_terms: all(search_term in found_terms for search_term in search_list))]
keepers
# product_id
# 1 [barn, cruiser, land]
# 3 [cruiser, land]
df.loc[df['product_id'].isin(keepers.index)]
# product_id keywords
# 0 1 land
# 1 1 barn
# 3 3 land
# 4 1 cruiser
# 6 3 cruiser
CodePudding user response:
For a different answer (in one line!):
import pandas as pd
s = pd.DataFrame({'product_id': [1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4],
'keywords': ["land", "cruiser", "land", "cruiser", "land",
"land", "land", "cruiser", "cruiser", "land", "cruiser"]})
search_words = ['land', 'cruiser']
s[[(all(p in x for p in search_words)) \
for x in s.merge(s.groupby("product_id").apply(
lambda x: list(set(x["keywords"]))).to_frame(). \
rename(columns={0: "grouped"}), left_on="product_id",
right_index=True)["grouped"]]]
#Out:
# product_id keywords
#0 1 land
#1 1 cruiser
#2 1 land
#3 1 cruiser
#4 1 land
#9 4 land
#10 4 cruiser
EDIT
I have just used pandas groupby filtration, and realised it would be much more simple to use this here instead. The returned dataframe is exactly the same as the one above, just using a lot less code!
s.groupby("product_id").filter(lambda x: all(p in list(set(x["keywords"])) for p in search_words))