Home > Enterprise >  The feature is not supported error during filtering out data from DataFrame based on other DataFrame
The feature is not supported error during filtering out data from DataFrame based on other DataFrame

Time:10-20

I would like to filter out from allData Dataframe those records which types appears in the wrongTypes DataFrame (it has like 2000 records). I convert the wrongTypes DF to list of Strings then in the filter check if record is in that list. Here is my code:

import org.apache.spark.sql.functions._
import spark.implicits._

val allData = Seq(
  ("id1", "X"),
  ("id2", "X"),
  ("id3", "Y"),
  ("id4", "A")
).toDF("id", "type")

val wrongTypes = Seq(
  ("X"),
  ("Y"),
  ("Z")
  ).toDF("type").select("type").map(r => r.getString(0)).collect.toList

allData.filter(col("type").isin(wrongTypes)).show()

and I get this error:

SparkRuntimeException: The feature is not supported: literal for 'List(X, Y, Z)' of class scala.collection.immutable.$colon$colon.

CodePudding user response:

allData.filter(col("type").isInCollection(wrongTypes)).show()

CodePudding user response:

If the wrongTypes dataframe is large, you may want to not collect the results, and just use the dataframes.

I am not sure how to express it on the Dataframe level, but in sql terms you want to use the IN clause, like this:

allData.createOrReplaceTempView("all_data")
wrongTypes.createOrReplaceTempView("wrong_types")

spark.sql("""
    SELECT 
      * 
    FROM all_data
    WHERE type in (SELECT type FROM wrong_types)
""").show()

CodePudding user response:

isin() is function with variable number of arguments and if you want to pass collection to it you must use splat operator:

col("type").isin(wrongTypes:_*)

Other option is to use isInCollection() which has Iterable as argument. I suggest using it as you mentioned that you expect about 2K entries.

  • Related