Home > Back-end >  Find elements in one RDD but not in ther other RDD
Find elements in one RDD but not in ther other RDD

Time:11-08

I have two JavaRDD A and B. I want to only keep longs that are in A but not in B. How should I do that? Thanks!

CodePudding user response:

I am posting a solution in scala. Should be almost similar in Java.

Do a leftOuterJoin which would give all the records in the first rdd alongwith matching records from the second rdd. Like WrappedArray((168,(def,None)), (192,(abc,Some(abc)))). But to keep the record only present in first rdd, we apply a filter over None.

val data = spark.sparkContext.parallelize(Seq((192, "abc"),(168, "def")))
val data2 = spark.sparkContext.parallelize(Seq((192, "abc")))

val result = data
.leftOuterJoin(data2)
.filter(record => record._2._2 == None)

println(result.collect.toSeq)
Output> WrappedArray((168,(def,None)))
  • Related