My task is to tokenize a corpus into words and then find ones that end with "able". However, this error occurs.
>>> import nltk
>>> import re
>>> from nltk.corpus import gutenberg as guten
>>> guten_words = guten.words('austen-emma.txt')
>>> len(guten_words)
192427
>>> able_words = re.findall(r'able$',guten_words)
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
able_words = re.findall(r'able$',guten_words)
File "C:\Program Files\Python37\lib\re.py", line 225, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
>>>
If I try to add "str" to correct it like that:
able_words = re.findall(r'able$',str(guten_words))
... I get 0 results. What am I doing wrong?
CodePudding user response:
You seam to be trying to search in a list of strings but not in a string (as it's requested: https://docs.python.org/3/library/re.html#re.findall). When you're casting a list to string, you get something like this: [, , …]
You should search in, for example, "\n".join(guten_words)
if they're strings.
Or just found_words = sum([my_func_to_list(re.findall(word)) for word in guten_words], [])
CodePudding user response:
try this:
list(filter(lambda x:x.re.findall(r'able$', x), guten_words))
guten_words
is a list
try print the type of guten_words
print(type(guten_words))