Check for duplicated values in sortedMap-CodePudding

I have a map initialized as

val cache: SortedMap<String, String> = sortedMapOf()

The map is used as a cache and can contain duplicated values with their own unique key. I want to check and count how many duplicated values there are in the cache. Note that the cache can contain millions of entries.

As of now, I check for duplicates in this way

val uniqueValueSet = hashSetOf<String>()
val numDuplicates = cache.filter {!uniqueValueSet.add(it.value)}.count()

However, I feel like this check is memory inefficient, where adding all the distinct values to a set creates an obsolete set with possibly millions of elements.

Is there a better, and more optimized, way of checking duplicates between the values in a map?

CodePudding user response：

If you're interested only in amount of duplicates you can do just:

val numOfDuplicates = cache.size - cache.values.toHashSet().size

It will still create a set with all distinct values, but it will be the only overhead.

Another option is to trade space complexity (O(N) -> O(M), where N - size of cache, M - amount of unique duplicates; makes sense if M << N) to time complexity (O(N*logN) -> O(N^2)):

var numOfDuplicates = 0
val duplicates = hashSetOf<String>()
for (value in cache.values) {
    if (value in duplicates) {
        numOfDuplicates  
    } else if (cache.values.atLeastTwo { it == value }) {
        duplicates.add(value)
    }
}

public inline fun <T> Iterable<T>.atLeastTwo(predicate: (T) -> Boolean): Boolean {
    var atLeastOne = false
    for (it in this) {
        if (predicate(it)) {
            if (!atLeastOne) {
                atLeastOne = true
            } else {
                return true
            }
        }
    }
    return false
}

CodePudding user response：

This should be the fastest (inspired by Михаил Нафталь's solution):

val numDuplicates = cache.size - cache.values.distinct().size

Edit 1: I did some tests and up to 50.000 entries (key and value each Random strings of 10 alphanumeric characters) this is indeed faster. Above approximately 50.000 entries Михаил Нафталь's solution becomes drastically faster:

val numOfDuplicates = cache.size - cache.values.toHashSet().size

Edit 2: Added some simple tests:

val count_of_entries = 1_000_000
val lengthOfRandomKey = 10
val lengthOfRandomValue = 10

fun randomString(length: Int): String {
  val charPool: List<Char> = ('a'..'z')   ('A'..'Z')   ('0'..'9')
  return (1..length).map { _ -> kotlin.random.Random.nextInt(0, charPool.size) }.map(charPool::get)
    .joinToString("")
}

val cache: java.util.SortedMap<String, String> = sortedMapOf()
repeat (count_of_entries) { cache[randomString(lengthOfRandomKey)] = randomString(lengthOfRandomValue) }

val uniqueValueSet = hashSetOf<String>()
val start1 = System.nanoTime()
val numDuplicates1 = cache.filter { !uniqueValueSet.add(it.value) }.count()
val end1 = System.nanoTime()
println("$numDuplicates1 duplicates, time ${(end1 - start1).toDouble() / 1_000_000_000} s")

val start2 = System.nanoTime()
val numDuplicates2 = cache.size - cache.values.toHashSet().size
val end2 = System.nanoTime()
println("$numDuplicates2 duplicates, time ${(end2 - start2).toDouble() / 1_000_000_000} s")

val start3 = System.nanoTime()
val numDuplicates3 = cache.size - cache.values.distinct().size
val end3 = System.nanoTime()
println("$numDuplicates3 duplicates, time ${(end3 - start3).toDouble() / 1_000_000_000} s")