Return unique element col value from the file using spark scala/python-CodePudding

Consider a file with above two columns and has products column with diff products in the store, I need to return only unique product which is in just one store and return the store name. I have tried below approach but looking for efficient solution. Thanks in advance.

store   products
walmart  eggs,cereals,milk
target   toys,eggs,cereals
costco   eggs,cereals,milk

val df1 = dataDF.select("prods").agg(collect_list("prods")).collect.toArray
df1(0).getSeq[String](0).toList.map(x => x.split(",")).flatten.groupBy((word: String) => word).mapValues(_.length).filter(x=> x._2==1 ).keys.head

=> this returns toys, then filter that respective store from df. But it doesn't seem efficient .

The expected output

target toys

CodePudding user response：

You could try this:

dataDf
  .withColumn("products", split($"products", ","))  // Parse as array
  .withColumn("product", explode($"products"))      // Explode into rows
  .groupBy($"product")
  .agg(collect_list($"store").as("stores"))         // Get list of stores as array
  .filter(size($"stores") === 1)                    // Where there's only one store selling
  .show