I have two JavaRDD A and B. I want to only keep longs that are in A but not in B. How should I do that? Thanks!
CodePudding user response:
I am posting a solution in scala. Should be almost similar in Java.
Do a leftOuterJoin
which would give all the records in the first rdd alongwith matching records from the second rdd. Like WrappedArray((168,(def,None)), (192,(abc,Some(abc))))
. But to keep the record only present in first rdd, we apply a filter over None
.
val data = spark.sparkContext.parallelize(Seq((192, "abc"),(168, "def")))
val data2 = spark.sparkContext.parallelize(Seq((192, "abc")))
val result = data
.leftOuterJoin(data2)
.filter(record => record._2._2 == None)
println(result.collect.toSeq)
Output> WrappedArray((168,(def,None)))