I would like to filter out from allData Dataframe those records which types appears in the wrongTypes DataFrame (it has like 2000 records). I convert the wrongTypes DF to list of Strings then in the filter check if record is in that list. Here is my code:
import org.apache.spark.sql.functions._
import spark.implicits._
val allData = Seq(
("id1", "X"),
("id2", "X"),
("id3", "Y"),
("id4", "A")
).toDF("id", "type")
val wrongTypes = Seq(
("X"),
("Y"),
("Z")
).toDF("type").select("type").map(r => r.getString(0)).collect.toList
allData.filter(col("type").isin(wrongTypes)).show()
and I get this error:
SparkRuntimeException: The feature is not supported: literal for 'List(X, Y, Z)' of class scala.collection.immutable.$colon$colon.
CodePudding user response:
allData.filter(col("type").isInCollection(wrongTypes)).show()
CodePudding user response:
If the wrongTypes
dataframe is large, you may want to not collect the results, and just use the dataframes.
I am not sure how to express it on the Dataframe level, but in sql terms you want to use the IN
clause, like this:
allData.createOrReplaceTempView("all_data")
wrongTypes.createOrReplaceTempView("wrong_types")
spark.sql("""
SELECT
*
FROM all_data
WHERE type in (SELECT type FROM wrong_types)
""").show()
CodePudding user response:
isin()
is function with variable number of arguments and if you want to pass collection to it you must use splat operator:
col("type").isin(wrongTypes:_*)
Other option is to use isInCollection()
which has Iterable
as argument. I suggest using it as you mentioned that you expect about 2K entries.