I am reading data from an excel file using apache poi and transforming it into a list of object. But now I want to extract any duplicates based on certain rules into another list of that object and also get the non-duplicate list.
Condition to check for a duplicate
- name
- phone number
- gst number
Any of these properties can result in a duplicate. which mean or not an and
Party Class
public class Party {
private String name;
private Long number;
private String email;
private String address;
private BigDecimal openingBalance;
private LocalDateTime openingDate;
private String gstNumber;
// Getter Setter Skipped
}
Let's say this is the list returned by the logic to excel data so far
var firstParty = new Party();
firstParty.setName("Valid Party");
firstParty.setAddress("Valid");
firstParty.setEmail("Valid");
firstParty.setGstNumber("Valid");
firstParty.setNumber(1234567890L);
firstParty.setOpeningBalance(BigDecimal.ZERO);
firstParty.setOpeningDate(DateUtil.getDDMMDateFromString("01/01/2020"));
var secondParty = new Party();
secondParty.setName("Valid Party");
secondParty.setAddress("Valid Address");
secondParty.setEmail("Valid Email");
secondParty.setGstNumber("Valid GST");
secondParty.setNumber(7593612247L);
secondParty.setOpeningBalance(BigDecimal.ZERO);
secondParty.setOpeningDate(DateUtil.getDDMMDateFromString("01/01/2020"));
var thirdParty = new Party();
thirdParty.setName("Valid Party 1");
thirdParty.setAddress("address");
thirdParty.setEmail("email");
thirdParty.setGstNumber("gst");
thirdParty.setNumber(7593612888L);
thirdParty.setOpeningBalance(BigDecimal.ZERO);
secondParty.setOpeningDate(DateUtil.getDDMMDateFromString("01/01/2020"));
var validParties = List.of(firstParty, secondParty, thirdParty);
What I have attempted so far :-
var partyNameOccurrenceMap = validParties.parallelStream()
.map(Party::getName)
.collect(Collectors.groupingBy(Function.identity(), HashMap::new, Collectors.counting()));
var partyNameOccurrenceMapCopy = SerializationUtils.clone(partyNameOccurrenceMap);
var duplicateParties = validParties.stream()
.filter(party-> {
var occurrence = partyNameOccurrenceMap.get(party.getName());
if (occurrence > 1) {
partyNameOccurrenceMap.put(party.getName(), occurrence - 1);
return true;
}
return false;
})
.toList();
var nonDuplicateParties = validParties.stream()
.filter(party -> {
var occurrence = partyNameOccurrenceMapCopy.get(party.getName());
if (occurrence > 1) {
partyNameOccurrenceMapCopy.put(party.getName(), occurrence - 1);
return false;
}
return true;
})
.toList();
The above code only checks for party name but we also need to check for email, phone number and gst number.
The code written above works just fine but the readability, conciseness and the performance might be an issue as the data set is large enough like 10k rows in excel file
CodePudding user response:
Never ignore Equals/hashCode contract
name
,number
,gstNumber
Any of these properties can result in a duplicate, which mean or
Your definition of a duplicate implies that any of these properties should match, whilst others might not.
It means that it's impossible to provide an implementation equals/hashCode
that would match the given definition and doesn't violate the hashCode contract.
If two objects are equal according to the
equals
method, then calling thehashCode
method on each of the two objects must produce the same integer result.
I.e. if you implement equals
in such a way they any (not all) of these properties: name
, email
, number
, gstNumber
could match, and that would enough to consider the two objects equal, then there's no way to implement hashCode
correctly.
And as the consequence of this, you can't use the object with a broken equals/hashCode
implementation in with a hash-based Collection because equal objects might end up the in the different bucket (since they can produce different hashes). I.e. HashMap
would not be able to recognize the duplicated keys, hence groupingBy
with groupingBy()
with Function.identity()
as a classifier function would not work properly.
Therefore, to address this problem, you need to implement equals()
based on all 4
properties: name
, email
, number
, gstNumber
(i.e. all these values have to be equal), and similarly all these values must contribute to hash-code.
How to determine Duplicates
There's no easy way to determine duplicates by multiple criteria. The solution you've provided is not viable, since we can't rely on the equals/hashCode
.
The only way is to generate a HashMap
separately for each end every attribute (i.e. in this case we need 4
maps). But can we alternate this, avoiding repeating the same steps for each map and hard coding the logic?
Yes, we can.
We can create a custom generic accumulation type (it would be suitable for any class - no hard-coded logic) that would encapsulate all the logic of determining duplicates and maintain an arbitrary number of maps under the hood. After consuming all the elements from the given collection, this custom object would be aware of all the duplicates in it.
That's how it can be implemented.
A custom accumulation type that would be used as container of a custom Collector. Its constructor expects varargs of functions, each function correspond to the property that should be taken into account while checking whether an object is a duplicate.
public static class DuplicateChecker<T> implements Consumer<T> {
private List<DuplicateHandler<T>> handles;
private Set<T> duplicates;
@SafeVarargs
public DuplicateChecker(Function<T, ?>... keyExtractors) {
this.handles = Arrays.stream(keyExtractors)
.map(DuplicateHandler::new)
.toList();
}
@Override
public void accept(T t) {
handles.forEach(h -> h.accept(t));
}
public DuplicateChecker<T> merge(DuplicateChecker<T> other) {
for (DuplicateHandler<T> handler: handles) {
other.handles.forEach(handler::merge);
}
return this;
}
public DuplicateChecker<T> finish() {
duplicates = handles.stream()
.flatMap(handler -> handler.getDuplicates().stream())
.flatMap(Set::stream)
.collect(Collectors.toSet());
return this;
}
public boolean isDuplicate(T t) {
return duplicates.contains(t);
}
}
A helper class representing a single createrion (like name
, email
, etc.) which encapsulates a HashMap
. keyExtractor
is used to obtain a key from an object of type T
.
public static class DuplicateHandler<T> implements Consumer<T> {
private Map<Object, Set<T>> itemByKey = new HashMap<>();
private Function<T, ?> keyExtractor;
public DuplicateHandler(Function<T, ?> keyExtractor) {
this.keyExtractor = keyExtractor;
}
@Override
public void accept(T t) {
itemByKey.computeIfAbsent(keyExtractor.apply(t), k -> new HashSet<>()).add(t);
}
public void merge(DuplicateHandler<T> other) {
other.itemByKey.forEach((k, v) ->
itemByKey.merge(k,v,(oldV, newV) -> { oldV.addAll(newV); return oldV; }));
}
public Collection<Set<T>> getDuplicates() {
Collection<Set<T>> duplicates = itemByKey.values();
duplicates.removeIf(set -> set.size() == 1); // the object is proved to be unique by this particular property
return duplicates;
}
}
And that is the method, responsible for generating the map of duplicates, that would be used from the clean code. The given collection would be partitioned into two parts: one mapped to the key true
- duplicates, another mapped to the key false
- unique objects.
public static <T> Map<Boolean, List<T>> getPartitionByProperties(Collection<T> parties,
Function<T, ?>... keyExtractors) {
DuplicateChecker<T> duplicateChecker = parties.stream()
.collect(Collector.of(
() -> new DuplicateChecker<>(keyExtractors),
DuplicateChecker::accept,
DuplicateChecker::merge,
DuplicateChecker::finish
));
return parties.stream()
.collect(Collectors.partitioningBy(duplicateChecker::isDuplicate));
}
And that how you can apply it for your particular case.
main()
public static void main(String[] args) {
List<Party> parties = // initializing the list of parties
Map<Boolean, List<Party>> isDuplicate = partitionByProperties(parties,
Party::getName, Party::getNumber,
Party::getEmail, Party::getGstNumber);
}
CodePudding user response:
I would use create a map for each property where
key
is the property we want to check duplicatevalue
is aSet
containing all the index of element in the list with same key.
Then we can
- filter values in the map with more that 1 index (i.e. duplicate indexes).
- union all the duplicate index
- determine if the element is duplicate/unique by using the duplicate index.
The time complexity is roughly O(n).
public class UniquePerEachProperty {
private static void separate(List<Party> partyList) {
Map<String, Set<Integer>> nameToIndexesMap = new HashMap<>();
Map<String, Set<Integer>> emailToIndexesMap = new HashMap<>();
Map<Long, Set<Integer>> numberToIndexesMap = new HashMap<>();
Map<String, Set<Integer>> gstNumberToIndexesMap = new HashMap<>();
for (int i = 0; i < partyList.size(); i ) {
Party party = partyList.get(i);
nameToIndexesMap.putIfAbsent(party.getName(), new HashSet<>());
nameToIndexesMap.get(party.getName()).add(i);
emailToIndexesMap.putIfAbsent(party.getEmail(), new HashSet<>());
emailToIndexesMap.get(party.getEmail()).add(i);
numberToIndexesMap.putIfAbsent(party.getNumber(), new HashSet<>());
numberToIndexesMap.get(party.getNumber()).add(i);
gstNumberToIndexesMap.putIfAbsent(party.getGstNumber(), new HashSet<>());
gstNumberToIndexesMap.get(party.getGstNumber()).add(i);
}
Set<Integer> duplicatedIndexes = Stream.of(
nameToIndexesMap.values(),
emailToIndexesMap.values(),
numberToIndexesMap.values(),
gstNumberToIndexesMap.values()
).flatMap(Collection::stream).filter(indexes -> indexes.size() > 1)
.flatMap(Set::stream).collect(Collectors.toSet());
List<Party> duplicatedList = new ArrayList<>();
List<Party> uniqueList = new ArrayList<>();
for (int i = 0; i < partyList.size(); i ) {
Party party = partyList.get(i);
if (duplicatedIndexes.contains(i)) {
duplicatedList.add(party);
} else {
uniqueList.add(party);
}
}
System.out.println("duplicated:" duplicatedList);
System.out.println("unique:" uniqueList);
}
public static void main(String[] args) {
separate(List.of(
// name duplicate
new Party("name1", 1L, "email1", "gstNumber1"),
new Party("name1", 2L, "email2", "gstNumber2"),
// number duplicate
new Party("name3", 3L, "email3", "gstNumber3"),
new Party("name4", 3L, "email4", "gstNumber4"),
// email duplicate
new Party("name5", 5L, "email5", "gstNumber5"),
new Party("name6", 6L, "email5", "gstNumber6"),
// gstNumber duplicate
new Party("name7", 7L, "email7", "gstNumber7"),
new Party("name8", 8L, "email8", "gstNumber7"),
// unique
new Party("name9", 9L, "email9", "gstNumber9")
));
}
}
Assume Party has below constructor
and toString()
(for testing)
public class Party {
public Party(String name, Long number, String email, String gstNumber) {
this.name = name;
this.number = number;
this.email = email;
this.address = "";
this.openingBalance = BigDecimal.ZERO;
this.openingDate = LocalDateTime.MIN;
this.gstNumber = gstNumber;
}
@Override
public String toString() {
return "Party{"
"name='" name '\''
", number=" number
", email='" email '\''
", gstNumber='" gstNumber '\''
'}';
}
...
}