Home > other >  Python, return unique and exact match of substrings in a pandas dataframe column from a list of desi
Python, return unique and exact match of substrings in a pandas dataframe column from a list of desi

Time:04-08

import pandas as pd

wordsWeWant = ["ball", "bat", "ball-sports"]

words = [
"football, ball-sports, ball",
"ball, bat, ball, ball, ball, ballgame, football, ball-sports",
"soccer",
"football, basketball, roundball, ball" ]

df = pd.DataFrame({"WORDS":words})
df["WORDS_list"] = df["WORDS"].str.split(",")

Which results in a dataframe with a column full a string value that is always separated by a comma and no whitespaces between (can have hyphen, underscore, numbers, and other non-characters). Also, substrings can appear multiple times and also before or after a partial match (partials are not to be returned, only exact).

WORDS   WORDS_list
football, ball-sports, ball ['football', ' ball-sports', ' ball']
ball, bat, ball, ball, ball, ballgame, football, ball-sports    ['ball', ' bat', ' ball', ' ball', ' ball', ' ballgame', ' football', ' ball-sports']
soccer  ['soccer']
football, basketball, roundball, ball   ['football', ' basketball', ' roundball', ' ball']

(sorry for above, I can't figure out how to paste the output dataframe or how to paste from Excel)

What I want is a new column without duplicate matches. I tried using some regex but couldn't get it to work as expected. Next, I tried set operations using intersection but when I convert the column into a list (i.e. "WORDS_list") and then ran this

df["WORDS_list"].apply(lambda x: list(set(x).intersection(set(wordsWeWant))))

I ended up with unexpected output (see below:

0        []
1    [ball]
2        []
3        []

My real dataset can be quite large with multiple items to check in the string so want to avoid a nested for-loop of iterating the wordsWeWant over "WORDS" column and was thinking .map or .apply is the faster approach. It is okay if the returned column is a list, I manipulate it into a single string of comma and space separated words.

CodePudding user response:

Notice the split is ', '

df["WORDS_list"] = df["WORDS"].str.split(", ")
df["WORDS_list"].apply(lambda x: list(set(x).intersection(set(wordsWeWant))))
Out[242]: 
0         [ball-sports, ball]
1    [bat, ball-sports, ball]
2                          []
3                      [ball]
Name: WORDS_list, dtype: object
  • Related