The feature is not supported error during filtering out data from DataFrame based on other DataFrame-CodePudding

I would like to filter out from allData Dataframe those records which types appears in the wrongTypes DataFrame (it has like 2000 records). I convert the wrongTypes DF to list of Strings then in the filter check if record is in that list. Here is my code:

import org.apache.spark.sql.functions._
import spark.implicits._

val allData = Seq(
  ("id1", "X"),
  ("id2", "X"),
  ("id3", "Y"),
  ("id4", "A")
).toDF("id", "type")

val wrongTypes = Seq(
  ("X"),
  ("Y"),
  ("Z")
  ).toDF("type").select("type").map(r => r.getString(0)).collect.toList

allData.filter(col("type").isin(wrongTypes)).show()

and I get this error:

SparkRuntimeException: The feature is not supported: literal for 'List(X, Y, Z)' of class scala.collection.immutable.$colon$colon.

CodePudding user response：

allData.filter(col("type").isInCollection(wrongTypes)).show()

CodePudding user response：

If the wrongTypes dataframe is large, you may want to not collect the results, and just use the dataframes.

I am not sure how to express it on the Dataframe level, but in sql terms you want to use the IN clause, like this:

allData.createOrReplaceTempView("all_data")
wrongTypes.createOrReplaceTempView("wrong_types")

spark.sql("""
    SELECT 
      * 
    FROM all_data
    WHERE type in (SELECT type FROM wrong_types)
""").show()

CodePudding user response：

isin() is function with variable number of arguments and if you want to pass collection to it you must use splat operator:

col("type").isin(wrongTypes:_*)

Other option is to use isInCollection() which has Iterable as argument. I suggest using it as you mentioned that you expect about 2K entries.