Fastest way to join 2 collections in Java based on a key-CodePudding

I'm looking for the fastest way to merge two unsorted collections based on a common id key.

Below O(N^2) implementation

    for (Person per : pers) {
      for (Data data : datas) {
          if (per.getId().equals(data.getId())) {
              per.getData().add(data);
          }
        } 
      }

I'm looking for the fastest possible way (and lowest memory footprint possible) to achieve this result, possibly O(N). Duplicates should be removed from per.getData(). For now, per.getData() is a HashSet

Any idea how this could be optimized ? I'm using java 11

CodePudding user response：

Do one pass over persons to collect into a map for later O(1) lookup, then do one pass over data adding it to person:

Map<Object, Person> people = pers.stream()
  .collect(Collectors.toMap(Person::getId, p -> p));
datas.forEach(d -> people.get(d.getId()).add(d));

If it’s possible for a data to have a matching person, filter out unmatched data:

datas.stream()
  .filter(d -> people.containsKey(d.getId()))
  .forEach(d -> people.get(d.getId()).add(d));

Both ways are O(m n) (m people, n datas), because all map operations are O(1).

You mentioned that duplicates should be removed from person’s data. Being a HashSet (or any kind of Set), duplicates are automatically removed if equals() and hashCode() are coded properly for Data.

CodePudding user response：

Here's a linear approach (O(n)) that is better than O(n^2) but will use memory.

Create a HashMap<personId, personObject> then loop on persons and insert them into the map.
Loop on Datas and check if the dataId is present in the HashMap. If it exists, get the personObject and add the dataObject to its HashSet.

HashMap<Integer, Person> mp = new HashMap<>();
for (Person per : pers) {
   mp.put(per.getId(), per);
}
for (Data data : datas) {
   if (mp.get(data.getId()) != null) {
      Person person = mp.get(data.getId());
      person.getData().add(data);
      mp.put(person.getId(), person);
   }
}

Please note that I am assuming that you're using Integers as Ids. You can change the code to suit your case.