I have a DataFrame, and a list of key words, how can I extract matched words from the Text in the DataFrame. Can anyone help? Thank you!
** DataFrame**
df = pd.DataFrame({'ID':range(1,6), 'text':['red blue', 'bbb', 'rrrr blue', 'yyy b', 'ed yye']})
** key word list **
kword = ['red', 'rrrr']
I have tried following:
keyword = r"keyword.csv"
kword = pd.read_csv(keyword , encoding_errors='ignore')
Wrd_list = kword.values.tolist()
pattern = '|'.join(str(v) for v in Wrd_list)
filename = r"text.csv"
data = pd.read_csv(filename, encoding_errors='ignore')
df = pd.DataFrame(data, columns=["id", "Text"])
df['Match_Word'] = df['Text'].str.extract(f"({'|'.join(pattern)})")
but the output only kept the first letter, I tried to use extractall
function, it gave an error message:
0 R
1
2 R
3
4
5
My desired output should be:
0 red
1
2 rrrr
3
4
5
CodePudding user response:
You could use str.extract
to extract the relevant keywords; then fill the NaNs with empty strings:
df['text'] = df['text'].str.extract(f"({'|'.join(kword)})").fillna('')
Output:
ID text
0 1 red
1 2
2 3 rrrr
3 4
4 5
CodePudding user response:
Your code works fine. I think your issue is that you are getting wrong keyword pattern. Try adding header=None
to the kword csv.
import pandas as pd
keyword = "np-match/keyword.csv"
kword = pd.read_csv(keyword, encoding_errors="ignore", header=None)
Wrd_list = kword.values.tolist()
pattern = "|".join(str(v) for v in Wrd_list)
pattern = ["red", "rrr"]
filename = "np-match/text.csv"
data = pd.read_csv(filename, encoding_errors="ignore")
df = pd.DataFrame(data, columns=["id", "Text"])
df["Match_Word"] = df["Text"].str.extract(f"({'|'.join(pattern)})")
id Text Match_Word
0 1 red blue red
1 2 bbbb NaN
2 3 rrr blue rrr
3 4 yyyy NaN