Home > OS >  GroupBy with list of keys -- Scala
GroupBy with list of keys -- Scala

Time:10-20

I'm micro-optimising some code as a challenge.

I have a list of objects with a list of keys in each of them. What's the most efficient way of grouping them by key, with each object being in every group of which it has a key.

This is what I have, but I have a feeling it can be improved. I have many objects (100k ), each has ~2 keys, and there's less than 50 possible keys. I've tried parallelising the list with listOfObjs.par, but there doesn't seem to be much of an improvement overall.

case class Obj(value: Option[Int], key: Option[List[String]])

listOfObjs
  .filter(x => x.key.isDefined && x.value.isDefined)
  .flatMap(x => x.key.get.map((_, x.value.get)))
  .groupBy(_._1)

CodePudding user response:

If you have that many object, the logical next step would be to distribute the work by using a MapReduce framework. At the end of the day you still need to go over every single object to determine the group it belongs in and your worst case will be bottlenecked by that.

The best you can do here is to replace these 3 operations by a fold so you only iterate through the collection once.

Edit: Updated the order based on Luis' recommendation in the comments

listOfObj.foldLeft(Map.empty[String, List[Int]]){ (acc, obj) =>
  (obj.key, obj.value) match { 
    case (Some(k), Some(v)) => 
      k.foldLeft(acc)((a, ky) => a   (ky -> {v  : a.getOrElse(ky, List.empty)}))))
    case _ => acc
  }
}

CodePudding user response:

I got the impression you are looking for a fast alternative; thus a little bit of encapsulated mutability can help.
So, what about something like this:

def groupObjectsByKey(objects: List[Obj]): Map[String, List[Int]] = {
  val iter =
    objects.iterator.flatMap {
      case Obj(Some(value), Some(keys)) =>
        keys.iterator.map(key => key -> value)
      
      case _ =>
        Iterator.empty[(String, Int)]
    }

  val m =
    mutable
      .Map
      .empty[String, mutable.Builder[Int, List[Int]]
  
  iter.foreach {
    case (k, v) =>
      m.get(key = k) match {
        case Some(builder) =>
          builder.addOne(v)
          
        case None =>
          m.update(key = k, value = List.newBuilder[Int].addOne(v))
      }
  }
  
  immutable
    .Map
    .mapFactory[String, List[Int]]
    .fromSpecific(m.view.mapValues(_.result()))
}

Or if you don't care about the order of the elements of each group we can simplify and speed up the code a lot:

def groupObjectsByKey(objects: List[Obj]): Map[String, List[Int]] = {
  val iter = objects.iterator.flatMap {
    case Obj(Some(value), Some(keys)) =>
      keys.iterator.map(key => key -> value)
    
    case _ =>
      Iterator.empty[(String, Int)]
  }

  val m = mutable.Map.empty[String, List[Int]]
  
  iter.foreach {
    case (k, v) =>
      m.updateWith(key = k) match {
        case Some(list) =>
          Some(v :: list)
          
        case None =>
          Some(v :: Nil)
      }
  }

  m.to(immutable.Map)
}
  • Related