How to check if a given english sentence contains all non-meaning words using python?-CodePudding

I want to check in a Python program if a given english sentence contains all non-meaning words.

Return true if sentence has all words that have no meaning

e.g. sdfsdf sdf ssdf fsdf dsd sd

Return false if sentence contains at least one word that has meaning

e.g. Hello asdf

Here is the code I wrote.

import nltk

nltk.download('words')

from nltk.corpus import words

def is_sentence_meaningless(sentence):
  is_meaningless = False
  for word in sentence.split():
    if(word in words.words()):
      is_meaningless = True
      break
  return is_meaningless    


print(is_sentence_meaningless("sss sss asdfasdf asdfasdfa asdfasfsd"))

print(is_sentence_meaningless(" sss sss asdfasdf asdfasdfa asdfasfsd TEST"))

Is there a better alternative to this code? Also, how can I add my own corpus to it? For example I have few domain specific words that I want it to return as true, is that possible?

CodePudding user response：

You can use set.difference method (note that since words in nltk.corpus.words are mostly in lower case, have to use str.lower method as well, e.g. "hello" is in but "Hello" isn't):

def is_sentence_meaningless(sentence, domain_specific_words):
    s_set = set(sentence.lower().split())
    if s_set.difference(words.words() domain_specific_words) == s_set:
        return True
    return False

Just FYI but your function does not do what your explanation says.

CodePudding user response：

Given that the word list contains only unique words, the function can be made more efficient by converting the list to a set.

Also, your logic doesn't seem to align with the implied purpose of the function (based on its name). A sentence would be meaningless if any of the words in the sentence are not found in the corpus set.

There is a considerable overhead in converting the word list to a set. Therefore, if the function is going to be used multiple times, it would be better to wrap it in a class.

Thus:

import nltk.corpus

class sentence_checker:
    def __init__(self):
        self.words = set(nltk.corpus.words.words())
    def is_sentence_meaningless(self, sentence):
        for word in sentence.split():
            if not word in self.words:
                return True
        return False

sc = sentence_checker()
print(sc.is_sentence_meaningless('hello'))
print(sc.is_sentence_meaningless('hellfffo'))