Python - Find matching string(s) between DataFrame column (scraped text) and list of strings-CodePudding

I am having an hard time comparing the strings from a DataFrame column with a list of strings.

Let me explain to you: I collected data from social media for a personal project, and aside of that I created a list of string like the following:

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

There are other words but this is just to give you an idea.

My goal is to compare EACH of this list's words, with 2 existing DF columns which contains titles and posts messages (from reddit). To be clear, I want to create a new column where to display the words which match between my list to the columns containing the posts.

So far, this is what I have done:

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

df['matched text'] = df.text_lemmatized.str.extract('({0})'.format('|'.join(the_list)), flags = re.IGNORECASE)
df = df[~pd.isna(df['matched text'])]

df

>>Outpout:

      title_lemmatized   text_lemmatized        matched_word(s)
0         Title1       'claim thorough vet...'      'ai'
1         Title@       'Yeaaah today iota...'       'IoT'

Here the output result for more details.

The issue: The main problem is that its returning me letters (not actual words) that matches the list.

Example:

--> the_list = 'ai' (for artificial intelligence) or IoT (for Internet of Things)

--> df['text_lemmatized'] has the word 'claim' in the text, then 'ai' will be the match. or 'Iota' will match with 'IoT'.

What I wish:

   title_lemmatized       text_lemmatized             matched_word(s)
0    Title1         'AI claim that Iot devises...'      'AI', 'IoT'
1    Title2         'The claim story about...'
2    Title3         'augmented reality and ai are...'   'augmented reality', 'ai'
3    Title4         'AI ai or artificial intelligence'  'AI', 'ai', 'artificial intelligence'

Thanks lot :)

CodePudding user response：

You have to add word boundaries '\b' to your regex pattern. From the re module docs:

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

Besides that, you want to use Series.str.findall (or Series.str.extractall) instead of Series.str.extract to find all the matches.

This should work

the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']

pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)