How to remove Keys that would cause Collisions before executing Collectors.toMap()-CodePudding

I have a stream of objects similar to this previous question, however, instead of ignoring duplicate values, I would like to remove any values from that stream beforehand and print them out.

For example, from this snippet:

Map<String, String> phoneBook = people.stream()
                                      .collect(toMap(Person::getName,
                                                     Person::getAddress));

If there were duplicate entries, it would cause a java.lang.IllegalStateException: Duplicate key error to be thrown.

The solution proposed in that question used a mergeFunction to keep the first entry if a collision was found.

Map<String, String> phoneBook = 
    people.stream()
          .collect(Collectors.toMap(
             Person::getName,
             Person::getAddress,
             (address1, address2) -> {
                 System.out.println("duplicate key found!");
                 return address1;
             }
          ));

Instead of keeping the first entry, if there is a collision from a duplicate key in the stream, I want to know which value caused the collision and make sure that there are no occurrences of that value within the resulting map.

I.e. if "Bob" appeared three times in the stream, it should not be in the map even once.

In the process of creating that map, I would like to filter out any duplicate names and record them some way.

I want to make sure that when creating the list there can be no duplicate entry and for there to be some way to know which entries had duplicate keys in incoming stream. I was thinking about using groupingBy and filter beforehand to find the duplicate keys, but I am not sure what the best way to do it is.

CodePudding user response：

You can do it in a single stream statement by utilizing Collectors.teeing() and a custom object that will contain separate collections of duplicated and unique entries of the phone book.

Since the primarily function of this object only to carry the data I've implemented it as Java 16 record.

public record FilteredPhoneBook(Map<String, String> uniquePersonsAddressByName,
                                List<String> duplicatedNames) {}

Collector teeing() expects three arguments: two collectors and a function that merges the results produced by both collectors.

The map generated by the groupingBy() in conjunction with counting(), is meant to determine duplicated names.

As @JimGarrison has pointed out, pre-processing the data doesn't make sense, and toMap which is used as the second collector will create a map containing all names.

When both collectors will hand out their results to the merger function, it will take care of removing the duplicates.

main() - demo

public static void main(String[] args) {
    List<Person> people = List.of(
        new Person("Alise", "address1"),
        new Person("Bob", "address2"),
        new Person("Bob", "address3"),
        new Person("Carol", "address4"),
        new Person("Bob", "address5")
    );

   FilteredPhoneBook filteredPhoneBook =
        people.stream()
            .collect(Collectors.teeing(
                Collectors.groupingBy(Person::getName, Collectors.counting()),
                Collectors.toMap(
                    Person::getName,
                    Person::getAddress,
                    (left, right) -> left),
                (Map<String, Long> countByName, Map<String, String> addressByName) -> {
                    List<String> duplicates = countByName.entrySet().stream()
                        .filter(entry -> entry.getValue() > 1)
                        .map(Map.Entry::getKey)
                        .toList();
                    
                    addressByName.keySet().removeAll(new HashSet<>(duplicates));
                    return new FilteredPhoneBook(addressByName, duplicates);
                }
            ));
        
    System.out.println("Unique entries:");
    filteredPhoneBook.uniquePersonsAddressByName.forEach((k, v) -> System.out.println(k   " : "   v));
    System.out.println("\nDuplicates:");
    filteredPhoneBook.duplicatedNames().forEach(System.out::println);
}

Output

Unique entries:
Alise : address1
Carol : address4

Duplicates:
Bob

CodePudding user response：

You can't know which keys are duplicates until you have processed the entire input stream. Therefore, any pre-processing step has to make a complete pass of the input before your main logic, which is wasteful.

An alternate approach could be:

Use the merge function to insert a dummy value for the offending key
At the same time, insert the offending key into a Set<K>
After the input stream is processed, iterate over the Set<K> to remove offending keys from the primary map.

CodePudding user response：

If I have understood correctly your clarifications in the comments, people who appear more than once in your list should not be contained within the final map.

if "Bob" appeared three times in the stream, it should not be in the map even once.

And each person who appears more than once should be stored in a List of duplicated people.

I would like to filter out any duplicate names and record them some way.

This should be what you were looking for.

List<String> peopleDuplicated = new ArrayList<>();
Map<String, String> phoneBook2 = people.stream()
        .collect(Collectors.groupingBy(Person::getName)) //grouping people by name
        .entrySet().stream() //creating a stream from the entries of the grouped map
        .peek(e -> {
            //Adding to the duplicated names list the name of every person with more than one address
            if (e.getValue().size() > 1){
                peopleDuplicated.add(e.getKey());
            }
        })
        .filter(e -> e.getValue().size() == 1) //keeping only the people which have occurred only once (no duplicates)
        .collect(Collectors.toMap(e -> e.getKey(), e -> e.getValue().get(0).getAddress())); //mapping the entries into a new map

Warning

Since the duplicate elements are stored via a stateful lambda, this solution should be used only with non-parallel streams as its outcome could be unpredictable in a parallel execution.

CodePudding user response：

In mathematical terms you want to partition your grouped aggregate and handle both parts separately.

Map<String, String> makePhoneBook(Collection<Person> people) {
    Map<Boolean, List<Person>> phoneBook = people.stream()
            .collect(Collectors.groupingBy(Person::getName))
            .values()
            .stream()
            .collect(Collectors.partitioningBy(list -> list.size() > 1,
            Collectors.mapping(r -> r.get(0),
                    Collectors.toList())));

    // handle duplicates
    phoneBook.get(true)
            .forEach(x -> System.out.println("duplicate found "   x));

    return phoneBook.get(false).stream()
            .collect(Collectors.toMap(
                    Person::getName,
                    Person::getAddress));
}