Home > database >  Finding duplications in a List of Lists
Finding duplications in a List of Lists

Time:10-19

I'm struggling with the following problem.

I've got a list of lists of an object having properties (let's say a name and a value).

I'd like to find and remove the duplicated lists if it's a part of another list - their order does matter.

Example in a pseudocode (I'll present it as a list of strings to make it easier):

List<List<String>>  = [
["1", "2", "3", "4", "5", "6", "7", "8"], 
["3", "4", "5", "6", "7", "8"],
["5", "6", "7", "8"],
["7", "8"]
]

in this case, I'd like to remove shorter lists because they are part of the longer/most extended list.

My classes can be described like that:

public static class MyObjectBig {
    private String startElem;
    private String endElem;
    private List<MyObjectOne> list;
    
    // constructors, getters, etc.
}

public static class MyObjectOne {
    private String name;
    private String value;

    // constructors, getters, etc.
}

The large lists are massive - like 21,000 elements, and the small lists are most at 20 elements, usually ~10.

I've got a couple of ideas, like creating a Map of the having the first item as a key and all list as a value. Then iterating over all items checking if it exists in it if it does, then checking the next items. But that's very slow.

I'll appreciate any hints or ideas.

CodePudding user response:

Assuming that equals/hashCode contract is properly implemented in the MyObjectOne you can define an auxiliary wrapper class which hold a reference to the original MyObjectBig instance and maintain a HashSet of map entries generated based on the map of frequencies of each MyObjectOne elements from the original list (so if there could be duplicated elements within a list, they would be taken into account).

That's how such wrapper class might look like:

public static class BigObjectWrapper {
    private MyObjectBig bigObject;
    private Set<Map.Entry<MyObjectOne, Long>> set;
    private int size;
    private boolean isDuplicate;
    
    public BigObjectWrapper(MyObjectBig bigObject) {
        this.bigObject = bigObject;
        this.set = bigObject.getList().stream()
            .collect(Collectors.groupingBy(
                Function.identity(),
                Collectors.counting()
            ))
            .entrySet(); // we need a set because map interface doesn't facitates check if it contains another map
        
        this.size = set.size();
    }
    
    public boolean contains(BigObjectWrapper other) {
        if (size < other.size) return false;
        
        return set.containsAll(other.set);
    }
    
    public boolean isDuplicate() {
        return isDuplicate;
    }
    
    public void setDuplicate() {
        isDuplicate = true;
    }
    
    // getters, equals/hashCode implemented based on the size and set properties
}

Now, the algorithm can be implemented in the following steps:

  • Create a list of wrapped objects.
  • Compare each wrapper against others. If the wrapper has been already proven to be a duplicate, it would be skipped. Also, in cases when a wrapper object containing a smaller set would be compared against a wrapper object holding a greater set, the call would terminate immediately returning false (therefore there's no point in sorting the data).
  • Generate a new list of MyObjectBig from wrapper objects that were not proven to be duplicates.

That's how implementation might look like:

List<MyObjectBig> source = List.of();
        
List<BigObjectWrapper> wrappers = source.stream()
    .map(BigObjectWrapper::new)
    .toList();
        
for (BigObjectWrapper wrapper : wrappers) {
    if (wrapper.isDuplicate) continue;
    for (BigObjectWrapper next : wrappers) {
        if (next.isDuplicate) continue;
        if (wrapper.contains(next)) next.setDuplicate();
    }
}
        
List<MyObjectBig> result = wrappers.stream()
    .filter(w -> !w.isDuplicate())
    .map(BigObjectWrapper::getBigObject)
    .toList();

CodePudding user response:

I would work with an algorithm like this:

  1. Order the lists by their length. You can do this by assigning the inner lists to a new outer list that you sort with Collections.sort(List<T> newOuter, Comparator<? super T> c)) with T being your inner List<String> and a custom Comparator that compares for the length of a list. (String used here only since you used that for your example, type doesn't actually matter.)

  2. Start searching for the shortest list, first in the longest, then in the second longest, etc. until you can either remove the searched list or have searched all other lists. You don't have to touch that shortest list then, it will be kept. Repeat with the second shortest list, etc.

To search a list in another list:

  1. Search the index of the first element of the shorter list in the longer list.
  2. When you found that and the rest of the longer list is long enough to contain the shorter list, you grab a .subList from the longer list starting with the found index over the length of the shorter list.
  3. You can then compare these lists with .equals() to find if the shorter list is contained within the longer one. If your inner list elements don't support .equals() sufficiently for this comparison, you can write your own Comparator and use that to compare the two lists.

If no match was found in step 3, you can continue searching the first element of the shorter list in the longer list. As List.indexOf doesn't provide a parameter at which index to start the search, probably better use a classic for-loop with an index for this. (It also would allow you to use a custom comparison if .equals doesn't work for you.)

To remove a list that was found within another one, you have to take care to remove it from the original outer list, not from your sorted version.

After thinking a bit about it I'm actually not so sure if starting the search in the longest list will save you time on average, due to the length of the searched list. It probably depends a bit on the distribution of lengths.

Also I would like to warn you of premature optimization: first implement a working version, then use a profiler to find the places in your code where optimization will have the biggest impact.

  • Related