Python - Find words ending with "able" in a corpus-CodePudding

My task is to tokenize a corpus into words and then find ones that end with "able". However, this error occurs.

>>> import nltk
>>> import re
>>> from nltk.corpus import gutenberg as guten
>>> guten_words = guten.words('austen-emma.txt')
>>> len(guten_words)
192427
>>> able_words = re.findall(r'able$',guten_words)
Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    able_words = re.findall(r'able$',guten_words)
  File "C:\Program Files\Python37\lib\re.py", line 225, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
>>>

If I try to add "str" to correct it like that:

able_words = re.findall(r'able$',str(guten_words))

... I get 0 results. What am I doing wrong?

CodePudding user response：

You seam to be trying to search in a list of strings but not in a string (as it's requested: https://docs.python.org/3/library/re.html#re.findall). When you're casting a list to string, you get something like this: [, , …]

You should search in, for example, "\n".join(guten_words) if they're strings. Or just found_words = sum([my_func_to_list(re.findall(word)) for word in guten_words], [])

CodePudding user response：

try this:

list(filter(lambda x:x.re.findall(r'able$', x), guten_words))

guten_words is a list

try print the type of guten_words

print(type(guten_words))