Using parallel streams for finding similar strings in an array?-CodePudding

Let's say I have a large array of unique strings and I want to find all pairs which are at least 50% similar.

A straightforward implementation:

final String[] strings = {"abc", "dsfdsf", "foo", "baaa", ...};

final Collection<Pair<String, String>> matches = new ArrayList<>();

for (final String s1 : strings) {
    for (final String s2 : strings) {
        if (calculateSimilarity(s1, s2) >= 0.5) {
            matches.add(new Pair(s1, s2));
        }
    }
}

Now, let's make it multithreaded by using parallel stream:

for (final String s1 : strings) {
    Arrays.stream(strings).parallel()
        .filter(s2 -> calculateSimilarity(s1, s2) >= 0.5)
        .collect(toList())
        .stream()
        .forEach(s2 -> matches.add(new Pair(s1, s2)));
}

Each subsequent s1 string is compared (in parallel) with all s2 strings. All strings matching s1 are collected into a single list, and then they are added sequentially to matches (because ArrayList is not thread-safe).

This already works much faster than the sequential version. However, I want to introduce an improvement: avoid comparing the same two strings twice, because always calculateSimilarity("aaa", "bbb") == calculateSimilarity("bbb", "aaa"). I would also like to avoid comparing each string against itself.

So, back to the original algorithm:

for (int i = 0; i < strings.length; i  ) {
    for (int j = i   1; j < strings.length; j  ) {  // <--- sic! NOT int j = 0
            if (calculateSimilarity(strings[i], strings[j]) >= 0.5) {
                matches.add(new Pair(strings[i], strings[j]));
            }
        }
    }
}

Now, my question is: how to introduce this improvement to the parallel stream version?

Should I use .skip() somehow?

CodePudding user response：

if you use an indexed for loop you can look at the indexes above current index only to reduce number of comparisons and skip itself

for (int i = 0; i < strings.length; i  ) {
  String s1 = strings[i];
  for (int j = i   1; j < strings.length; j  ) { //<--- inner loop only looks at "new comparisons" due to i 1
    String s2 = strings[j];
    if (calculateSimilarity(s1, s2)) {
        matches.add(new Pair(s1, s2));
    }
}

CodePudding user response：

You can take advantage of knowing that similarity is idempotent, so you only need to compare from the string next position in array onwards.

Also, if possible, you can take advantage of stream parallel execution to process large arrays.

The following is an example of how you can achieve it.

final double ACCEPTED_SIMILARITY_INDEX = 0.75;
String[] strings = {"A", "B", "C", "AA", "BB"};

final Set<Set<ImmutablePair<String, String>>> collect = IntStream
    .range(0, strings.length)
    .parallel()
    .mapToObj(index -> new ImmutablePair<>(strings[index], index   1))
    .map(immutablePair -> Arrays.stream(strings, immutablePair.right, strings.length).parallel()
        .filter(stringToCompare -> new JaccardDistance().apply(stringToCompare, immutablePair.left) > ACCEPTED_SIMILARITY_INDEX)
        .map(similarString -> new ImmutablePair<>(immutablePair.left, similarString)).collect(Collectors.toSet())).filter(workingSet -> !workingSet.isEmpty())
    .collect(Collectors.toSet());

CodePudding user response：

My solution is based on the approach of this answer, but it uses Java 8 streams and adds some parallelism.

int len = strings.length;
List<Pair<String, String>> result = 
      LongStream.range(0, (long)(len) * len))
                .parallel()
                .filter(l -> (l / len > l % len) && 
                             calculateSimilarity(strings[l / len], 
                                                 strings[l % len]) > 0.5)
                .map(l -> new Pair<>(strings[l / len], strings[l % len])
                .collect(toList());

The approach I use is to iterate the positions of an imaginary len x len matrix flattened into 1-D. The (l / len) and (l % len) map the 1-D coordinates to 2-D, and then the (l / len > l % len) test checks that we are above the diagonal of the 2-D matrix.

I try to avoid creating any intermediate structures (e.g. temporary arrays or a de-duping HashSet) and any Pair objects that would be discarded.

Notes:

If we could constrain the length of strings to be less than 2^16, we could use an IntStream and int calculations.
If there are duplicates in the input strings array, there will be duplicates in the list of Pair objects.
This is still going to be O(N^2) where N is strings.length.