Consider a file with above two columns and has products column with diff products in the store, I need to return only unique product which is in just one store and return the store name. I have tried below approach but looking for efficient solution. Thanks in advance.
store products
walmart eggs,cereals,milk
target toys,eggs,cereals
costco eggs,cereals,milk
val df1 = dataDF.select("prods").agg(collect_list("prods")).collect.toArray
df1(0).getSeq[String](0).toList.map(x => x.split(",")).flatten.groupBy((word: String) => word).mapValues(_.length).filter(x=> x._2==1 ).keys.head
=> this returns toys, then filter that respective store from df. But it doesn't seem efficient .
The expected output
target toys
CodePudding user response:
You could try this:
dataDf
.withColumn("products", split($"products", ",")) // Parse as array
.withColumn("product", explode($"products")) // Explode into rows
.groupBy($"product")
.agg(collect_list($"store").as("stores")) // Get list of stores as array
.filter(size($"stores") === 1) // Where there's only one store selling
.show