I am reading data from an excel file using apache poi and transforming it into a list of object. But now I want to extract any duplicates based on certain rules into another list of that object and also get the non-duplicate list.
Condition to check for a duplicate

name
email
phone number
gst number

Any of these properties can result in a duplicate. which mean or not an and

Party Class

public class Party {

    private String name;

    private Long number;

    private String email;

    private String address;

    private BigDecimal openingBalance;

    private LocalDateTime openingDate;

    private String gstNumber;
   // Getter Setter Skipped
}

Let's say this is the list returned by the logic to excel data so far

    var firstParty = new Party();
    firstParty.setName("Valid Party");
    firstParty.setAddress("Valid");
    firstParty.setEmail("Valid");
    firstParty.setGstNumber("Valid");
    firstParty.setNumber(1234567890L);
    firstParty.setOpeningBalance(BigDecimal.ZERO);
    firstParty.setOpeningDate(DateUtil.getDDMMDateFromString("01/01/2020"));

    var secondParty = new Party();
    secondParty.setName("Valid Party");
    secondParty.setAddress("Valid Address");
    secondParty.setEmail("Valid Email");
    secondParty.setGstNumber("Valid GST");
    secondParty.setNumber(7593612247L);
    secondParty.setOpeningBalance(BigDecimal.ZERO);
    secondParty.setOpeningDate(DateUtil.getDDMMDateFromString("01/01/2020"));

    var thirdParty = new Party();
    thirdParty.setName("Valid Party 1");
    thirdParty.setAddress("address");
    thirdParty.setEmail("email");
    thirdParty.setGstNumber("gst");
    thirdParty.setNumber(7593612888L);
    thirdParty.setOpeningBalance(BigDecimal.ZERO);
    secondParty.setOpeningDate(DateUtil.getDDMMDateFromString("01/01/2020"));

    var validParties = List.of(firstParty, secondParty, thirdParty);

What I have attempted so far :-


var partyNameOccurrenceMap = validParties.parallelStream()
        .map(Party::getName)
        .collect(Collectors.groupingBy(Function.identity(), HashMap::new, Collectors.counting()));

var partyNameOccurrenceMapCopy = SerializationUtils.clone(partyNameOccurrenceMap);

var duplicateParties = validParties.stream()
        .filter(party-> {
            var occurrence = partyNameOccurrenceMap.get(party.getName());
            if (occurrence > 1) {
                partyNameOccurrenceMap.put(party.getName(), occurrence - 1);
                return true;
            }
            return false;
        })
        .toList();
var nonDuplicateParties = validParties.stream()
        .filter(party -> {
            var occurrence = partyNameOccurrenceMapCopy.get(party.getName());
            if (occurrence > 1) {
                partyNameOccurrenceMapCopy.put(party.getName(), occurrence - 1);
                return false;
            }
            return true;
        })
        .toList();

The above code only checks for party name but we also need to check for email, phone number and gst number.

The code written above works just fine but the readability, conciseness and the performance might be an issue as the data set is large enough like 10k rows in excel file

CodePudding user response：

Never ignore Equals/hashCode contract

name, email, number, gstNumber

Any of these properties can result in a duplicate, which mean or

Your definition of a duplicate implies that any of these properties should match, whilst others might not.

It means that it's impossible to provide an implementation equals/hashCode that would match the given definition and doesn't violate the hashCode contract.

If two objects are equal according to the equals method, then calling the hashCode method on each of the two objects must produce the same integer result.

I.e. if you implement equals in such a way they any (not all) of these properties: name, email, number, gstNumber could match, and that would enough to consider the two objects equal, then there's no way to implement hashCode correctly.

And as the consequence of this, you can't use the object with a broken equals/hashCode implementation in with a hash-based Collection because equal objects might end up the in the different bucket (since they can produce different hashes). I.e. HashMap would not be able to recognize the duplicated keys, hence groupingBy with groupingBy() with Function.identity() as a classifier function would not work properly.

Therefore, to address this problem, you need to implement equals() based on all 4 properties: name, email, number, gstNumber (i.e. all these values have to be equal), and similarly all these values must contribute to hash-code.

How to determine Duplicates

There's no easy way to determine duplicates by multiple criteria. The solution you've provided is not viable, since we can't rely on the equals/hashCode.

The only way is to generate a HashMap separately for each end every attribute (i.e. in this case we need 4 maps). But can we alternate this, avoiding repeating the same steps for each map and hard coding the logic?

Yes, we can.

We can create a custom generic accumulation type (it would be suitable for any class - no hard-coded logic) that would encapsulate all the logic of determining duplicates and maintain an arbitrary number of maps under the hood. After consuming all the elements from the given collection, this custom object would be aware of all the duplicates in it.

That's how it can be implemented.

A custom accumulation type that would be used as container of a custom Collector. Its constructor expects varargs of functions, each function correspond to the property that should be taken into account while checking whether an object is a duplicate.

public static class DuplicateChecker<T> implements Consumer<T> {
    
    private List<DuplicateHandler<T>> handles;
    private Set<T> duplicates;

    @SafeVarargs
    public DuplicateChecker(Function<T, ?>... keyExtractors) {
        this.handles = Arrays.stream(keyExtractors)
            .map(DuplicateHandler::new)
            .toList();
    }

    @Override
    public void accept(T t) {
        handles.forEach(h -> h.accept(t));
    }
    
    public DuplicateChecker<T> merge(DuplicateChecker<T> other) {
        for (DuplicateHandler<T> handler: handles) {
            other.handles.forEach(handler::merge);
        }
        
        return this;
    }

    public DuplicateChecker<T> finish() {
        duplicates = handles.stream()
            .flatMap(handler -> handler.getDuplicates().stream())
            .flatMap(Set::stream)
            .collect(Collectors.toSet());
        
        return this;
    }
    
    public boolean isDuplicate(T t) {
        return duplicates.contains(t);
    }
}

A helper class representing a single createrion (like name, email, etc.) which encapsulates a HashMap. keyExtractor is used to obtain a key from an object of type T.

public static class DuplicateHandler<T> implements Consumer<T> {
    private Map<Object, Set<T>> itemByKey = new HashMap<>();
    private Function<T, ?> keyExtractor;

    public DuplicateHandler(Function<T, ?> keyExtractor) {
        this.keyExtractor = keyExtractor;
    }

    @Override
    public void accept(T t) {
        itemByKey.computeIfAbsent(keyExtractor.apply(t), k -> new HashSet<>()).add(t);
    }

    public void merge(DuplicateHandler<T> other) {
        other.itemByKey.forEach((k, v) ->
            itemByKey.merge(k,v,(oldV, newV) -> { oldV.addAll(newV); return oldV; }));
    }
    
    public Collection<Set<T>> getDuplicates() {
        Collection<Set<T>> duplicates = itemByKey.values();
        duplicates.removeIf(set -> set.size() == 1); // the object is proved to be unique by this particular property
        
        return duplicates;
    }
}

And that is the method, responsible for generating the map of duplicates, that would be used from the clean code. The given collection would be partitioned into two parts: one mapped to the key true - duplicates, another mapped to the key false - unique objects.

public static <T> Map<Boolean, List<T>> getPartitionByProperties(Collection<T> parties,
                                                                 Function<T, ?>... keyExtractors) {
    
    DuplicateChecker<T> duplicateChecker = parties.stream()
        .collect(Collector.of(
            () -> new DuplicateChecker<>(keyExtractors),
            DuplicateChecker::accept,
            DuplicateChecker::merge,
            DuplicateChecker::finish
        ));
    
    return parties.stream()
        .collect(Collectors.partitioningBy(duplicateChecker::isDuplicate));
}

And that how you can apply it for your particular case.

main()

public static void main(String[] args) {
    List<Party> parties = // initializing the list of parties
    
    Map<Boolean, List<Party>> isDuplicate = partitionByProperties(parties,
            Party::getName, Party::getNumber,
            Party::getEmail, Party::getGstNumber);
}

CodePudding user response：

I would use create a map for each property where

key is the property we want to check duplicate
value is a Set containing all the index of element in the list with same key.

Then we can

filter values in the map with more that 1 index (i.e. duplicate indexes).
union all the duplicate index
determine if the element is duplicate/unique by using the duplicate index.

The time complexity is roughly O(n).

public class UniquePerEachProperty {
    private static void separate(List<Party> partyList) {
        Map<String, Set<Integer>> nameToIndexesMap = new HashMap<>();
        Map<String, Set<Integer>> emailToIndexesMap = new HashMap<>();
        Map<Long, Set<Integer>> numberToIndexesMap = new HashMap<>();
        Map<String, Set<Integer>> gstNumberToIndexesMap = new HashMap<>();
        for (int i = 0; i < partyList.size(); i  ) {
            Party party = partyList.get(i);
            nameToIndexesMap.putIfAbsent(party.getName(), new HashSet<>());
            nameToIndexesMap.get(party.getName()).add(i);

            emailToIndexesMap.putIfAbsent(party.getEmail(), new HashSet<>());
            emailToIndexesMap.get(party.getEmail()).add(i);

            numberToIndexesMap.putIfAbsent(party.getNumber(), new HashSet<>());
            numberToIndexesMap.get(party.getNumber()).add(i);

            gstNumberToIndexesMap.putIfAbsent(party.getGstNumber(), new HashSet<>());
            gstNumberToIndexesMap.get(party.getGstNumber()).add(i);
        }
        Set<Integer> duplicatedIndexes = Stream.of(
                        nameToIndexesMap.values(),
                        emailToIndexesMap.values(),
                        numberToIndexesMap.values(),
                        gstNumberToIndexesMap.values()
                ).flatMap(Collection::stream).filter(indexes -> indexes.size() > 1)
                .flatMap(Set::stream).collect(Collectors.toSet());
        List<Party> duplicatedList = new ArrayList<>();
        List<Party> uniqueList = new ArrayList<>();
        for (int i = 0; i < partyList.size(); i  ) {
            Party party = partyList.get(i);
            if (duplicatedIndexes.contains(i)) {
                duplicatedList.add(party);
            } else {
                uniqueList.add(party);
            }
        }
        System.out.println("duplicated:"   duplicatedList);
        System.out.println("unique:"   uniqueList);
    }

    public static void main(String[] args) {
        separate(List.of(
                // name duplicate
                new Party("name1", 1L, "email1", "gstNumber1"),
                new Party("name1", 2L, "email2", "gstNumber2"),
                // number duplicate
                new Party("name3", 3L, "email3", "gstNumber3"),
                new Party("name4", 3L, "email4", "gstNumber4"),
                // email duplicate
                new Party("name5", 5L, "email5", "gstNumber5"),
                new Party("name6", 6L, "email5", "gstNumber6"),
                // gstNumber duplicate
                new Party("name7", 7L, "email7", "gstNumber7"),
                new Party("name8", 8L, "email8", "gstNumber7"),
                // unique
                new Party("name9", 9L, "email9", "gstNumber9")
        ));
    }
}

Assume Party has below constructor and toString()(for testing)

public class Party {
    public Party(String name, Long number, String email, String gstNumber) {
        this.name = name;
        this.number = number;
        this.email = email;
        this.address = "";
        this.openingBalance = BigDecimal.ZERO;
        this.openingDate = LocalDateTime.MIN;
        this.gstNumber = gstNumber;
    }

    @Override
    public String toString() {
        return "Party{"  
                "name='"   name   '\''  
                ", number="   number  
                ", email='"   email   '\''  
                ", gstNumber='"   gstNumber   '\''  
                '}';
    }
    ...
}