Home > Enterprise >  check if column of strings contain a word in a list of string and extract the words in python
check if column of strings contain a word in a list of string and extract the words in python

Time:02-16

I have a DataFrame, and a list of key words, how can I extract matched words from the Text in the DataFrame. Can anyone help? Thank you!

** DataFrame**

df = pd.DataFrame({'ID':range(1,6), 'text':['red blue', 'bbb', 'rrrr blue', 'yyy b', 'ed yye']})

enter image description here

** key word list **

kword = ['red', 'rrrr']

I have tried following:

keyword = r"keyword.csv"
kword = pd.read_csv(keyword , encoding_errors='ignore')
Wrd_list = kword.values.tolist()
pattern = '|'.join(str(v) for v in Wrd_list)

filename = r"text.csv"
data = pd.read_csv(filename, encoding_errors='ignore')
df = pd.DataFrame(data, columns=["id", "Text"])
df['Match_Word'] = df['Text'].str.extract(f"({'|'.join(pattern)})")

but the output only kept the first letter, I tried to use extractall function, it gave an error message:

0  R
1 
2  R
3  
4 
5

My desired output should be:

0 red
1 
2 rrrr
3
4
5

CodePudding user response:

You could use str.extract to extract the relevant keywords; then fill the NaNs with empty strings:

df['text'] = df['text'].str.extract(f"({'|'.join(kword)})").fillna('')

Output:

   ID  text
0   1   red
1   2      
2   3  rrrr
3   4      
4   5      

CodePudding user response:

Your code works fine. I think your issue is that you are getting wrong keyword pattern. Try adding header=None to the kword csv.

import pandas as pd
keyword = "np-match/keyword.csv"
kword = pd.read_csv(keyword, encoding_errors="ignore", header=None)
Wrd_list = kword.values.tolist()
pattern = "|".join(str(v) for v in Wrd_list)

pattern = ["red", "rrr"]
filename = "np-match/text.csv"
data = pd.read_csv(filename, encoding_errors="ignore")
df = pd.DataFrame(data, columns=["id", "Text"])
df["Match_Word"] = df["Text"].str.extract(f"({'|'.join(pattern)})")


   id      Text Match_Word
0   1  red blue        red
1   2      bbbb        NaN
2   3  rrr blue        rrr
3   4      yyyy        NaN
  • Related