I would like to know how you would find all the variations of a word, or the words that are related or very similar the the original word in Python.
An example of the sort of thing I am looking for is like this:
word = "summary" # any word
word_variations = find_variations_of_word(word) # a function that finds all the variations of a word, What i want to know how to make
print(word_variations)
# What is should print out: ["summaries", "summarize", "summarizing", "summarized"]
This is just an example of what the code should do, i have seen other similar question on this same topic, but none of them were accurate enough, i found some code and altered it to my own, which kinda works, but now to way i would like it to.
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def find_inflections(word):
inflections = []
for synset in wordnet.synsets(word): # Find all synsets for the word
for lemma in synset.lemmas(): # Find all lemmas for each synset
inflected_form = lemma.name().replace("_", " ") # Get the inflected form of the lemma
if inflected_form != word: # Only add the inflected form if it's different from the original word
inflections.append(inflected_form)
return inflections
word = "summary"
inflections = find_inflections(word)
print(inflections)
# Output: ['sum-up', 'drumhead', 'compendious', 'compact', 'succinct']
# What the Output should be: ["summaries", "summarize", "summarizing", "summarized"]
CodePudding user response:
This probably isn't of any use to you, but may help someone else who finds this with a search -
If the aim is just to find the words, rather than specifically to use a machine-learning approach to the problem, you could try using a regular expression (regex).
w3 schools seems to cover enough to get the result you want here or there is a more technical overview on python.org
to search case insensitively for the specific words you listed the following would work:
import re
string = "A SUMMARY ON SUMMATION:" \
"We use summaries to summarize. This action is summarizing. " \
"Once the action is complete things have been summarized."
occurrences = re.findall("summ[a-zA-Z]*", string, re.IGNORECASE)
print(occurrences)
However, depending on your precise needs you may need to modify the regular expression as this would also find words like 'summer' and 'summon'.
I'm not very good at regex but they can be a powerful tool if you know precisely what you are looking for and spend a little time crafting the right expression.
Sorry this probably isn't relevant to your circumstance but good luck.
CodePudding user response:
With the help of Paul Williams, I have come up with answer that gets almost all of the variations of the word correct about 90% of the time. I first check if the word was in another word, like "topic" and "topics", if that was the case then i just append it to the similar_list
, if that wasn't the case, then i just used a pre-trained Hugging Face transformers model (BERT) to generate the word embeddings for the words, and then used a similarity measure like cosine similarity to determine how similar the word embeddings are, because this is AI it doesn't get all the words right 100%, but is accurate enough for my use case.
import re
import torch
from transformers import BertTokenizer, BertModel
string = "A summarized book SUMMARY ON summarization:" \
"We use summaries to summon summarized version of a book this summer, supper. This action is summarizing AKA summarize. " \
"Once the action is complete things have been summarized."
topic = 'summary'
occurrences = re.findall(f"{topic[:2]}[a-zA-Z]*", string, re.IGNORECASE)
# Set up the BERT model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
similar_list = []
def is_match(word1, word2):
# Check if word1 is a substring of word2
if word1 in word2:
return True
# Check if word2 is a substring of word1
elif word2 in word1:
return True
else:
return False
for word in occurrences:
# Tokenize the words and convert them to tensors
word = word.lower()
topic = topic.lower()
if is_match(topic, word) == True:
similar_list.append(word)
else:
input_ids = torch.tensor([tokenizer.encode(word, topic)])
# Generate the word embeddings
with torch.no_grad():
outputs = model(input_ids)
embeddings = outputs[0]
# Get the word embeddings for each word
word_embedding = embeddings[0][0]
topic_embedding = embeddings[0][1]
# Calculate the cosine similarity
cosine_similarity = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
similarity = cosine_similarity(word_embedding, topic_embedding)
# Append the result
if similarity > 0.3:
similar_list.append(word)
# Dont do anything
else:
pass
print(similar_list)
# output: ['summary', 'summarization', 'summaries', 'summarizing', 'summarize']
I hoped this helped anyone else who wanted to do a similar check.