Home > Enterprise >  How to group by subitems of a list in a List?
How to group by subitems of a list in a List?

Time:10-31

I am new to Scala. I am curious about how to group sub-items of a list.

For example, I have data on categories of novels and their ratings. I want to get total ratings in a category.

Format of data: (Option[List(categories)], Option[rating])

var a = List(List(List("romance", "thriller"), 4), List(List("adventure", "thriller"), 3))

I want to get a mapping of Key(Category) => Value(Their ratings)

romance => (4,3)
thriller => (4,3)
adventure => (3)

I tried to do a.groupby(_._1), but it only groups when they have the exact same categories. I tried to search other posts but couldn't find any similar questions.

CodePudding user response:

You first want to transform your data to a List[(Category, Rating)], and then you can use groupMap to group by category.

val a = List(
  (List("romance", "thriller"), 4),
  (List("adventure", "thriller"), 3)
)

a.flatMap{ 
  case (cs, r) => cs.map(_ -> r)
}
.groupMap(_._1)(_._2)

CodePudding user response:

This data structure is extremely hard to work with, and that might be a cause of your problems. You have a list, which in turn contains lists, each of which in turn contains two elements, the first of which is another list, and the second element is a number.

This has many, many problems. For starters, almost none of the lists actually are lists.

The innermost list is not a list, since the order of the elements doesn't matter. It should probably be a set.

The middle list is very problematic: since it contains both a list and a number, it cannot have a useful type. The best we can do is List[Any] in Scala 2 or List[String | Integer] in Scala 3. The actual type that is inferred in Scala 3 is List[Matchable], which is essentially useless. It "pollutes" everything you try to do later. It should probably be a pair, i.e. a (Set[String], Integer).

And the outer list doesn't actually need to be a list either, it can be any sequence; we don't particularly care what specific kind of sequence it is. All we care about is that we can iterate over it.

So, let's fix that data structure first:

val ratings = Seq(
  Set("romance", "thriller")   -> 4,
  Set("adventure", "thriller") -> 3
)

This data structure is both much easier to work with and more closely resembles the semantics of what you are trying to model.

In order to get the individual ratings for the genres, we need to flatten this representation out, so we get something like

Seq(
  "romance"   -> 4,
  "thriller"  -> 4,
  "adventure" -> 3,
  "thriller"  -> 3
)

We can do this by flatMapping over ratings and mapping each genre to a pair of (genre, rating):

val individualRatings = for {
  (set, rating) <- ratings
  genre <- set
} yield genre -> rating

Now, we can groupBy the genre, which will give us something like

Map(
  "romance"   -> ("romance"   -> Seq(4)),
  "thriller"  -> ("thriller"  -> Seq(4, 3)),
  "adventure" -> ("adventure" -> Seq(3))
)

Which means we need to mapValues to just their second element, so we end up with something like this:

Map(
  "romance"   -> Seq(4),
  "thriller"  -> Seq(4, 3),
  "adventure" -> Seq(3)
)

And lastly, we need to reduce the list of ratings to a single rating.

There is actually the helpful method Map.groupMapReduce which combines these three operations in an efficient manner:

val totals = individualRatings.groupMapReduce(_._1)(_._2)(_   _)

And the end result is this:

Map(
  "adventure" -> 3,
  "romance"   -> 4,
  "thriller"  -> 7
)

However, this is not necessarily the best we can do. Scala is an object-oriented programming language, after all, not a list-of-lists-of-lists-of-strings-or-numbers-oriented language.

What we have done, is we have improved the data structure from a list-of-lists-of-lists-of-strings-or-numbers to a sequence-of-pairs-of-sets-of-strings-and-numbers. What we would really like to have is a sequence-of-ratings.

I will leave that as an exercise for the reader for now.

Note: even if you have no control over the data coming in, it still makes sense to "fix" the data as soon as it enters the system, and only ever work with the fixed data, instead of constantly having to deal with the broken data all the time.

You could, for example, create the data structure I outlined above from your data using something like this:

val ratings = for {
  (genres: List[String]) :: (rating: Integer) :: Nil <- a
} yield genres.toSet -> rating

Note that because of type erasure, this is actually not type-safe: at runtime, the element type of the List is erased, so it is impossible to know that the first element is a List[String] and the second element is an Integer. They are both treated as Matchables, even with the type test in the pattern.

  • Related