Wordnet: Finding the most common hypernyms-CodePudding

The task that I am trying to achieve is finding the top 20 most common hypernyms for all nouns and verbs in a text file. I believe that my output is erroneous and that there is a more elegant solution, particularly to avoid manually creating a list of the most common nouns and verbs and the code that iterates over the synsets to identify the hypernyms.

Please see below for the code I have attempted so far, any guidance would be appreciated:

nouns_verbs = [token.text for token in hamlet_spacy if (not token.is_stop and not token.is_punct and token.pos_ == "VERB" or token.pos_ == "NOUN")]

def check_hypernym(word_list):
    return_list=[]
    for word in word_list:
        w = wordnet.synsets(word)
        for syn in w:
            if not((len(syn.hypernyms()))==0):
                return_list.append(word)
                break
    return return_list

hypernyms = check_hyper(nouns_verbs)
fd = nltk.FreqDist(hypernyms)
top_20 = fd.most_common(20)

word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']

hypernym_list = []
for word in word_list:
    syn_list = wordnet.synsets(word)
    hypernym_list.append(syn_list)

    final_list = []
    for syn in syn_list:
        hypernyms_syn = syn.hypernyms()
        final_list.append(hypernyms_syn)

final_list

I tried identifying the top 20 most common words and verbs, and then identified their synsets and subsequently their hypernyms. I would prefer to use a more cohesive solution, especially since I am unsure of whether my current result is accurate.

CodePudding user response：

For the first part of getting all nouns and verbs from the text, you didn't provide the original text so I wasn't able to reproduce this but I believe you can shorten this since it is given that if a token is a noun or verb it is not punctuation. You can also use in so that you don't need two separate boolean conditions for NOUN and VERB.

nouns_verbs = [token.text for token in hamlet_spacy if not token.is_stop and token.pos_ in ["VERB", "NOUN"]]

Other than that it looks fine.

For the second part of getting the most common hypernyms, your general approach is fine. You could make it a little more memory efficient for long texts where you potentially have the same hypernym appearing many times by using a Counter object from the get-go instead of constructing a long list. See the below code.

from nltk.corpus import wordnet as wn
from collections import Counter

word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']

hypernym_counts = Counter()
for word in word_list:
    for synset in wn.synsets(word):
        hypernym_counts.update(synset.hypernyms())

top_20_hypernyms = hypernym_counts.most_common()[:20]
for i, hypernym in enumerate(top_20_hypernyms, start=1):
    hypernym, count = hypernym
    print(f"{i}. {hypernym.name()} ({count})")

Outputs:

1. time_period.n.01 (6)
2. be.v.01 (3)
3. communicate.v.02 (3)
4. male.n.02 (3)
5. think.v.03 (3)
6. male_aristocrat.n.01 (2)
7. letter.n.02 (2)
8. thyroid_hormone.n.01 (2)
9. experience.v.01 (2)
10. copulate.v.01 (2)
11. travel.v.01 (2)
12. time_unit.n.01 (2)
13. serve.n.01 (2)
14. induce.v.02 (2)
15. accept.v.03 (2)
16. make.v.02 (2)
17. leave.v.04 (2)
18. give.v.03 (2)
19. parent.n.01 (2)
20. make.v.03 (2)