Home > Software engineering >  How to find matched word only, no substring by using pandas str.extractall?
How to find matched word only, no substring by using pandas str.extractall?

Time:02-18

I am working with a column of string in a dataframe, and trying to extract all words which matched any words in a giving word list. It extracted all matched words and substring, how can I get only words? Many thanks!

My code:

import pandas as pd

cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])

df['findWord'] = df['Text'].str.extractall(f"({'|'.join(cl)})").groupby(level=0).agg(', '.join)

print(df)

Current output: how can only extract the word dust, not substring of 'industry'

   ID                  Text             findWord
0   1  dust industr yes inr  dust, dust, yes inr
1   2           state inner                inner
2   3                dustry                 dust

Expected output:

   ID                  Text             findWord
0   1  dust industr yes inr        dust, yes inr
1   2           state inner                inner
2   3                dustry                  Nan

CodePudding user response:

Maybe something like this:

import pandas as pd
import numpy as np

cl =['dust', 'inner']
data = [[1, 'dust industry inner'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])

df['findWord'] = [', '.join(set(d.split(' ')).intersection(set(cl))) for d in df['Text'].to_numpy()]
df = df.replace('', np.NaN)
   ID                 Text     findWord
0   1  dust industry inner  dust, inner
1   2          state inner        inner
2   3               dustry          NaN

Update 1: Try this with a regex pattern:

import pandas as pd

cl =['dust', 'yes inr', 'inner']
data = [[1, 'dust industr yes inr'], [2, 'state inner'],[3, 'dustry']]
df = pd.DataFrame(data, columns = ['ID', 'Text'])

regex = '({})'.format('|'.join('\\b{}\\b'.format(c) for c in cl))
df['findWord'] = df['Text'].str.extractall(regex).groupby(level=0).agg(', '.join)
   ID                  Text       findWord
0   1  dust industr yes inr  dust, yes inr
1   2           state inner          inner
2   3                dustry            NaN

CodePudding user response:

Fix your regex pattern by adding word boundries \b so that it only matches full words then use str.findall to find all occurrences of this pattern

df['findWord'] = df['Text'].str.findall(r'\b(%s)\b' % '|'.join(cl)).str.join(', ')

   ID                  Text       findWord
0   1  dust industr yes inr  dust, yes inr
1   2           state inner          inner
2   3                dustry               
  • Related