I have a list of dictionaries d
as follows:
d = [
{"A": "ls", "B": "p", "C": "near a big lake"},
{"A": "ls", "B": "p", "C": "lake"},
{"A": "ls", "B": "p", "C": "near a lake"},
{"A": "ls", "B": "q", "C": "fantastic"},
]
I am trying to simplify this list of dictionaries by removing the less verbose versions of the same triple. Triples are deemed the same if they follow the criteria:
- the values of key
B
match - all words from the value of
C
for a given triple are fully contained within the value ofC
of the other triple.
CodePudding user response:
Since nobody answered this, I will.
There is, of course, a naive solution. Which goes something like this untested code.
def remove_duplicates(d):
d2 = []
for i in range(len(d)):
found = False
for j in range(len(d)):
if i != j:
if d[i]["B"] == d[j]["B"]:
if all_words_included(d[i]["C"], d[j]["C"]):
if len(d[i]["C"]) < len(d[j]["C"]) or i < j:
found = True
break
if not found:
d2.append(d[i])
def sentence_to_word_freq (s):
dict = {}
for word in s.split(" "):
if word not in dict:
dict[word] = 1
else:
dict[word] = 1
return dict
def all_words_included (s1, s2):
freq1 = sentence_to_word_freq(s1)
freq2 = sentence_to_word_freq(s2)
for word, freq in freq1.items():
if word not in freq2 or freq2[word] < freq1[word]:
return False
return True
But that involves O(n^2)
checks. Is there a more efficient solution?
Well, yes, there is. But it is a lot harder to code. The basic idea is to create a trie data structure and use that to search for matches without having to visit items 1 by 1. There are a number of such approaches available, which are a lot more work. They are not worthwhile unless you will run this many, many times with a lot of data.