I am trying to find all matched words from a column of strings and a giving word list. If I use pandas str.extract(), I can get the first matched word, since I needs all the matched words, so I think pandas str.extractall() will work, however, I only got an ValueError. what will the problem be here? Thanks so much!
df['findWord'] = df['text'].str.extractall(f"({'|'.join(wordlist)})").fillna('')
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
CodePudding user response:
extract
returns the first match. extractall
generates one row per match.
Example, let's match A and the following letter.
df = pd.DataFrame({'col': ['ABC', 'ADAE']})
# col
# 0 ABC
# 1 ADAE
df['col'].str.extractall('(A.)')
This created a novel index level named "match" that identifies the match number. Matches from the same row are identified by the same first index level.
Output:
0
match
0 0 AB
1 0 AD
1 AE
With extract
:
df['col'].str.extract('(A.)')
Output:
0
0 AB
1 AD
aggregating the output of extractall
(df['col']
.str.extractall('(A.)')
.groupby(level='match').agg(','.join)
)
Output:
0
match
0 AB,AD
1 AE