Home > Mobile >  Wordnet: Finding the most common hypernyms
Wordnet: Finding the most common hypernyms

Time:02-05

The task that I am trying to achieve is finding the top 20 most common hypernyms for all nouns and verbs in a text file. I believe that my output is erroneous and that there is a more elegant solution, particularly to avoid manually creating a list of the most common nouns and verbs and the code that iterates over the synsets to identify the hypernyms.

Please see below for the code I have attempted so far, any guidance would be appreciated:

nouns_verbs = [token.text for token in hamlet_spacy if (not token.is_stop and not token.is_punct and token.pos_ == "VERB" or token.pos_ == "NOUN")]

def check_hypernym(word_list):
    return_list=[]
    for word in word_list:
        w = wordnet.synsets(word)
        for syn in w:
            if not((len(syn.hypernyms()))==0):
                return_list.append(word)
                break
    return return_list

hypernyms = check_hyper(nouns_verbs)
fd = nltk.FreqDist(hypernyms)
top_20 = fd.most_common(20)

word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']

hypernym_list = []
for word in word_list:
    syn_list = wordnet.synsets(word)
    hypernym_list.append(syn_list)

    final_list = []
    for syn in syn_list:
        hypernyms_syn = syn.hypernyms()
        final_list.append(hypernyms_syn)

final_list

I tried identifying the top 20 most common words and verbs, and then identified their synsets and subsequently their hypernyms. I would prefer to use a more cohesive solution, especially since I am unsure of whether my current result is accurate.

CodePudding user response:

For the first part of getting all nouns and verbs from the text, you didn't provide the original text so I wasn't able to reproduce this but I believe you can shorten this since it is given that if a token is a noun or verb it is not punctuation. You can also use in so that you don't need two separate boolean conditions for NOUN and VERB.

nouns_verbs = [token.text for token in hamlet_spacy if not token.is_stop and token.pos_ in ["VERB", "NOUN"]]

Other than that it looks fine.

For the second part of getting the most common hypernyms, your general approach is fine. You could make it a little more memory efficient for long texts where you potentially have the same hypernym appearing many times by using a Counter object from the get-go instead of constructing a long list. See the below code.

from nltk.corpus import wordnet as wn
from collections import Counter

word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']

hypernym_counts = Counter()
for word in word_list:
    for synset in wn.synsets(word):
        hypernym_counts.update(synset.hypernyms())

top_20_hypernyms = hypernym_counts.most_common()[:20]
for i, hypernym in enumerate(top_20_hypernyms, start=1):
    hypernym, count = hypernym
    print(f"{i}. {hypernym.name()} ({count})")

Outputs:

1. time_period.n.01 (6)
2. be.v.01 (3)
3. communicate.v.02 (3)
4. male.n.02 (3)
5. think.v.03 (3)
6. male_aristocrat.n.01 (2)
7. letter.n.02 (2)
8. thyroid_hormone.n.01 (2)
9. experience.v.01 (2)
10. copulate.v.01 (2)
11. travel.v.01 (2)
12. time_unit.n.01 (2)
13. serve.n.01 (2)
14. induce.v.02 (2)
15. accept.v.03 (2)
16. make.v.02 (2)
17. leave.v.04 (2)
18. give.v.03 (2)
19. parent.n.01 (2)
20. make.v.03 (2)
  • Related