Home > Back-end >  Scala - Ids lists of objects with duplicated values from spark dataset
Scala - Ids lists of objects with duplicated values from spark dataset

Time:08-06

I need to create an IDs lists for all objects that have identical (same value and quantity) parameters. I am looking for a solution that will be more efficient than two nested loops and an if.
Object structure in the dataset:

case class MergedProduct(id: String,
                   products: List[Product])

case class Product(productUrl: String, productId: String)

Example of data in dataset:

[  {
   "id": "ID1",
   "products": [
     {
       "product": {
         "productUrl": "SOMEURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "SOMEOTHERURL",
         "productId": "1"
       }
     }
   ],
 },
 {
   "id": "ID2",
   "products": [
     {
       "product": {
         "productUrl": "SOMEURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "SOMEOTHERURL",
         "productId": "1"
       }
     }
   ],
 },
 {
   "id": "ID3",
   "products": [
     {
       "product": {
         "productUrl": "DIFFERENTURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "SOMEOTHERURL",
         "productId": "1"
       }
     }
   ],
 },
 {
   "id": "ID4",
   "products": [
     {
       "product": {
         "productUrl": "SOMEOTHERURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "DIFFERENTURL",
         "productId": "1"
       }
     }
   ],
 },
 {
   "id": "ID5",
   "products": [
     {
       "product": {
         "productUrl": "NOTDUPLICATEDURL",
         "productId": "1"
       }
     },
     {
       "product": {
         "productUrl": "DIFFERENTURL",
         "productId": "1"
       }
     }
   ],
 }
]

In this example, we have 4 objects that are duplicated, so I would like to get their ID in the corresponding lists.

Example output is List[List[String]]: List(List("ID1", "ID2"), List("ID3","ID4")) I am looking for something efficient and readable - the dataset we are talking about has nearly 700 million objects.
As I can remove the listed duplicates from the dataset (it does not affect the database) because the goal is one - logging them exists, so I was thinking about the solution of taking MergedProduct one by one, searching for other MergedProduct with identical Products, getting their ID, logging in they exist and then remove the mentioned MergedProduct ID from the dataset and move on to the next one until I check the whole dataset but in this case I would have to collect it first as a list of MergedProducts and then do all operations - seems like going around

CodePudding user response:

After trying some options and looking for neat solutions- I think this is kinda ok:

      private def getDuplicates(mergedProducts: List[MergedProduct]): List[List[String]] = {
val duplicates = mergedProducts.groupBy(_.products.sortBy(_.product.productId)).filter(_._2.size > 1).values.toList
duplicates.map(duplicates => duplicates.map(_.id))
  }
  • Related