I am having an hard time comparing the strings from a DataFrame column with a list of strings.
Let me explain to you: I collected data from social media for a personal project, and aside of that I created a list of string like the following:
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
There are other words but this is just to give you an idea.
My goal is to compare EACH of this list's words, with 2 existing DF columns which contains titles and posts messages (from reddit). To be clear, I want to create a new column where to display the words which match between my list to the columns containing the posts.
So far, this is what I have done:
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
df['matched text'] = df.text_lemmatized.str.extract('({0})'.format('|'.join(the_list)), flags = re.IGNORECASE)
df = df[~pd.isna(df['matched text'])]
df
>>Outpout:
title_lemmatized text_lemmatized matched_word(s)
0 Title1 'claim thorough vet...' 'ai'
1 Title@ 'Yeaaah today iota...' 'IoT'
Here the output result for more details.
The issue: The main problem is that its returning me letters (not actual words) that matches the list.
Example:
--> the_list = 'ai' (for artificial intelligence) or IoT (for Internet of Things)
--> df['text_lemmatized'] has the word 'claim' in the text, then 'ai' will be the match. or 'Iota' will match with 'IoT'.
What I wish:
title_lemmatized text_lemmatized matched_word(s)
0 Title1 'AI claim that Iot devises...' 'AI', 'IoT'
1 Title2 'The claim story about...'
2 Title3 'augmented reality and ai are...' 'augmented reality', 'ai'
3 Title4 'AI ai or artificial intelligence' 'AI', 'ai', 'artificial intelligence'
Thanks lot :)
CodePudding user response:
You have to add word boundaries '\b'
to your regex pattern. From the re module docs:
\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
Besides that, you want to use Series.str.findall
(or Series.str.extractall
) instead of Series.str.extract
to find all the matches.
This should work
the_list = ['AI', 'NLP', 'approach', 'AR Cloud', 'Army_Intelligence', 'Artificial general intelligence', 'Artificial tissue', 'artificial_insemination', 'artificial_intelligence', 'augmented intelligence', 'augmented reality', 'authentification', 'automaton', 'Autonomous driving', 'Autonomous vehicles', 'bidirectional brain-machine interfaces', 'Biodegradable', 'biodegradable', 'Biotech', 'biotech', 'biotechnology', 'BMI', 'BMIs', 'body_mass_index', 'bourdon', 'Bradypus_tridactylus', 'cognitive computing', 'commercial UAVs', 'Composite AI', 'connected home', 'conversational systems', 'conversational user interfaces', 'dawdler', 'Decentralized web', 'Deep fakes', 'Deep learning', 'defrayal']
pat = r'\b({0})\b'.format('|'.join(the_list))
df['matched text'] = df.text_lemmatized.str.findall(pat, flags = re.IGNORECASE).map(", ".join)