How to find matched word only, no substring by using pandas str.extractall?-CodePudding

I am working with a column of string in a dataframe, and trying to extract all words which matched any words in a giving word list. It extracted all matched words and substring, how can I get only words? Many thanks!

My code:

import pandas as pd

cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])

df['findWord'] = df['Text'].str.extractall(f"({'|'.join(cl)})").groupby(level=0).agg(', '.join)

print(df)

Current output: how can only extract the word dust, not substring of 'industry'

   ID                  Text             findWord
0   1  dust industr yes inr  dust, dust, yes inr
1   2           state inner                inner
2   3                dustry                 dust

Expected output:

   ID                  Text             findWord
0   1  dust industr yes inr        dust, yes inr
1   2           state inner                inner
2   3                dustry                  Nan

CodePudding user response：

Maybe something like this:

import pandas as pd
import numpy as np

cl =['dust', 'inner']
data = [[1, 'dust industry inner'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])

df['findWord'] = [', '.join(set(d.split(' ')).intersection(set(cl))) for d in df['Text'].to_numpy()]
df = df.replace('', np.NaN)

   ID                 Text     findWord
0   1  dust industry inner  dust, inner
1   2          state inner        inner
2   3               dustry          NaN

Update 1: Try this with a regex pattern:

import pandas as pd

cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])

regex = '({})'.format('|'.join('\\b{}\\b'.format(c) for c in cl))
df['findWord'] = df['Text'].str.extractall(regex).groupby(level=0).agg(', '.join)

   ID                  Text       findWord
0   1  dust industr yes inr  dust, yes inr
1   2           state inner          inner
2   3                dustry            NaN

CodePudding user response：

Fix your regex pattern by adding word boundries \b so that it only matches full words then use str.findall to find all occurrences of this pattern

df['findWord'] = df['Text'].str.findall(r'\b(%s)\b' % '|'.join(cl)).str.join(', ')

   ID                  Text       findWord
0   1  dust industr yes inr  dust, yes inr
1   2           state inner          inner
2   3                dustry