Find all the variations (or tenses) of a word in Python-CodePudding

I would like to know how you would find all the variations of a word, or the words that are related or very similar the the original word in Python.

An example of the sort of thing I am looking for is like this:

word = "summary" # any word

word_variations = find_variations_of_word(word) # a function that finds all the variations of a word, What i want to know how to make

print(word_variations)

# What is should print out: ["summaries", "summarize", "summarizing", "summarized"]

This is just an example of what the code should do, i have seen other similar question on this same topic, but none of them were accurate enough, i found some code and altered it to my own, which kinda works, but now to way i would like it to.

import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def find_inflections(word):
    inflections = []
    for synset in wordnet.synsets(word):  # Find all synsets for the word
        for lemma in synset.lemmas():  # Find all lemmas for each synset
            inflected_form = lemma.name().replace("_", " ")  # Get the inflected form of the lemma
            if inflected_form != word:  # Only add the inflected form if it's different from the original word
                inflections.append(inflected_form)
    return inflections

word = "summary"
inflections = find_inflections(word)
print(inflections)  
# Output: ['sum-up', 'drumhead', 'compendious', 'compact', 'succinct']
# What the Output should be: ["summaries", "summarize", "summarizing", "summarized"]

CodePudding user response：

This probably isn't of any use to you, but may help someone else who finds this with a search -

If the aim is just to find the words, rather than specifically to use a machine-learning approach to the problem, you could try using a regular expression (regex).

w3 schools seems to cover enough to get the result you want here or there is a more technical overview on python.org

to search case insensitively for the specific words you listed the following would work:

import re

string =    "A SUMMARY ON SUMMATION:" \
            "We use summaries to summarize. This action is summarizing. " \
            "Once the action is complete things have been summarized."


occurrences = re.findall("summ[a-zA-Z]*", string, re.IGNORECASE)
    
print(occurrences)

However, depending on your precise needs you may need to modify the regular expression as this would also find words like 'summer' and 'summon'.

I'm not very good at regex but they can be a powerful tool if you know precisely what you are looking for and spend a little time crafting the right expression.

Sorry this probably isn't relevant to your circumstance but good luck.

CodePudding user response：

With the help of Paul Williams, I have come up with answer that gets almost all of the variations of the word correct about 90% of the time. I first check if the word was in another word, like "topic" and "topics", if that was the case then i just append it to the similar_list, if that wasn't the case, then i just used a pre-trained Hugging Face transformers model (BERT) to generate the word embeddings for the words, and then used a similarity measure like cosine similarity to determine how similar the word embeddings are, because this is AI it doesn't get all the words right 100%, but is accurate enough for my use case.

import re
import torch
from transformers import BertTokenizer, BertModel

string = "A summarized book SUMMARY ON summarization:" \
            "We use summaries to summon summarized version of a book this summer, supper. This action is summarizing AKA summarize. " \
            "Once the action is complete things have been summarized."

topic = 'summary'
occurrences = re.findall(f"{topic[:2]}[a-zA-Z]*", string, re.IGNORECASE)

# Set up the BERT model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

similar_list = []

def is_match(word1, word2):
    # Check if word1 is a substring of word2
    if word1 in word2:
        return True
    # Check if word2 is a substring of word1
    elif word2 in word1:
        return True
    else:
        return False


for word in occurrences:
    # Tokenize the words and convert them to tensors
    word = word.lower()
    topic = topic.lower()
    if is_match(topic, word) == True:
        similar_list.append(word)
    else:
        input_ids = torch.tensor([tokenizer.encode(word, topic)])

        # Generate the word embeddings
        with torch.no_grad():
            outputs = model(input_ids)
            embeddings = outputs[0]

        # Get the word embeddings for each word
        word_embedding = embeddings[0][0]
        topic_embedding = embeddings[0][1]

        # Calculate the cosine similarity
        cosine_similarity = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
        similarity = cosine_similarity(word_embedding, topic_embedding)

        # Append the result
        if similarity > 0.3:
            similar_list.append(word)
        # Dont do anything
        else:
            pass


print(similar_list)
# output: ['summary', 'summarization', 'summaries', 'summarizing', 'summarize']

I hoped this helped anyone else who wanted to do a similar check.