import pandas as pd
wordsWeWant = ["ball", "bat", "ball-sports"]
words = [
"football, ball-sports, ball",
"ball, bat, ball, ball, ball, ballgame, football, ball-sports",
"soccer",
"football, basketball, roundball, ball" ]
df = pd.DataFrame({"WORDS":words})
df["WORDS_list"] = df["WORDS"].str.split(",")
Which results in a dataframe with a column full a string value that is always separated by a comma and no whitespaces between (can have hyphen, underscore, numbers, and other non-characters). Also, substrings can appear multiple times and also before or after a partial match (partials are not to be returned, only exact).
WORDS WORDS_list
football, ball-sports, ball ['football', ' ball-sports', ' ball']
ball, bat, ball, ball, ball, ballgame, football, ball-sports ['ball', ' bat', ' ball', ' ball', ' ball', ' ballgame', ' football', ' ball-sports']
soccer ['soccer']
football, basketball, roundball, ball ['football', ' basketball', ' roundball', ' ball']
(sorry for above, I can't figure out how to paste the output dataframe or how to paste from Excel)
What I want is a new column without duplicate matches. I tried using some regex but couldn't get it to work as expected. Next, I tried set operations using intersection but when I convert the column into a list (i.e. "WORDS_list") and then ran this
df["WORDS_list"].apply(lambda x: list(set(x).intersection(set(wordsWeWant))))
I ended up with unexpected output (see below:
0 []
1 [ball]
2 []
3 []
My real dataset can be quite large with multiple items to check in the string so want to avoid a nested for-loop of iterating the wordsWeWant over "WORDS" column and was thinking .map or .apply is the faster approach. It is okay if the returned column is a list, I manipulate it into a single string of comma and space separated words.
CodePudding user response:
Notice the split is ', '
df["WORDS_list"] = df["WORDS"].str.split(", ")
df["WORDS_list"].apply(lambda x: list(set(x).intersection(set(wordsWeWant))))
Out[242]:
0 [ball-sports, ball]
1 [bat, ball-sports, ball]
2 []
3 [ball]
Name: WORDS_list, dtype: object