Home > OS >  Algorithm to prune list of dictionaries
Algorithm to prune list of dictionaries

Time:02-23

I have a list of dictionaries d as follows:

d = [
    {"A": "ls", "B": "p", "C": "near a big lake"},
    {"A": "ls", "B": "p", "C": "lake"},
    {"A": "ls", "B": "p", "C": "near a lake"},
    {"A": "ls", "B": "q", "C": "fantastic"},
]

I am trying to simplify this list of dictionaries by removing the less verbose versions of the same triple. Triples are deemed the same if they follow the criteria:

  • the values of key B match
  • all words from the value of C for a given triple are fully contained within the value of C of the other triple.

CodePudding user response:

Since nobody answered this, I will.

There is, of course, a naive solution. Which goes something like this untested code.

def remove_duplicates(d):
    d2 = []
    for i in range(len(d)):
        found = False
        for j in range(len(d)):
            if i != j:
                if d[i]["B"] == d[j]["B"]:
                    if all_words_included(d[i]["C"], d[j]["C"]):
                        if len(d[i]["C"]) < len(d[j]["C"]) or i < j:
                            found = True
                            break
        if not found:
            d2.append(d[i])

def sentence_to_word_freq (s):
    dict = {}
    for word in s.split(" "):
        if word not in dict:
            dict[word] = 1
        else:
            dict[word]  = 1
    return dict

def all_words_included (s1, s2):
   freq1 = sentence_to_word_freq(s1)
   freq2 = sentence_to_word_freq(s2)
   for word, freq in freq1.items():
       if word not in freq2 or freq2[word] < freq1[word]:
           return False
   return True

But that involves O(n^2) checks. Is there a more efficient solution?

Well, yes, there is. But it is a lot harder to code. The basic idea is to create a trie data structure and use that to search for matches without having to visit items 1 by 1. There are a number of such approaches available, which are a lot more work. They are not worthwhile unless you will run this many, many times with a lot of data.

  • Related