Home > Net >  Schema for type org.apache.spark.sql.types.DataType is not supported
Schema for type org.apache.spark.sql.types.DataType is not supported

Time:12-27

I try to create empty df with schema:

  val sparkConf = new SparkConf()
    .setAppName("app")
    .setMaster("local")

  val sparkSession = SparkSession
    .builder()
    .config(sparkConf)
    .getOrCreate()

  val sparkContext = sparkSession.sparkContext

  var tmpScheme = StructType(
    StructField("source_id", StringType, true) :: Nil)

var df = conf.SparkConf.sparkSession.createDataFrame(tmpScheme)

and got Schema for type org.apache.spark.sql.types.DataType is not supported ...

I don't understand why - there is no .DataType even in Imports:

import org.apache.spark.sql.types.{BooleanType, IntegerType, StringType, StructField, StructType}

What can be the problem here?

PS: spark version

  "org.apache.spark" %% "spark-sql" % "3.2.2", // spark
  "org.apache.spark" %% "spark-core" % "3.2.2", // spark

CodePudding user response:

If you check the documentation, you can see that the argument fields of StructType is of type Array[StructField] and you are passing StructField.

This means that you should wrap your StructField with Array, for example:

val simpleSchema = StructType(Array(
  StructField("source_id", StringType, true))
)

Good luck!

EDIT

The case with one parameter in createDataframe:

val data = Seq(
  Data(1, "test"),
  Data(2, "test2")
)
val dataDf = spark.createDataFrame(data)
dataDf.show(10, false)

The case with two parameterse in createDataframe:

val someSchema = List(
  StructField("number", IntegerType, true),
  StructField("word", StringType, true)
)
val someData = Seq(Row(1, "test"), Row(2, "test2"))
val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema)
)

The output for both cases is the same:

 ------ ----- 
|number|word |
 ------ ----- 
|1     |test |
|2     |test2|
 ------ ----- 

In your case, the schema is trying to be inferred from attributes of the class (StructType) and is trying to be populated with StructField: source_id. StructType extends DataType and that is where your error comes from (Spark can not resolve the type)

  • Related