I try to create empty df with schema:
val sparkConf = new SparkConf()
.setAppName("app")
.setMaster("local")
val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val sparkContext = sparkSession.sparkContext
var tmpScheme = StructType(
StructField("source_id", StringType, true) :: Nil)
var df = conf.SparkConf.sparkSession.createDataFrame(tmpScheme)
and got Schema for type org.apache.spark.sql.types.DataType is not supported ...
I don't understand why - there is no .DataType
even in Imports:
import org.apache.spark.sql.types.{BooleanType, IntegerType, StringType, StructField, StructType}
What can be the problem here?
PS: spark version
"org.apache.spark" %% "spark-sql" % "3.2.2", // spark
"org.apache.spark" %% "spark-core" % "3.2.2", // spark
CodePudding user response:
If you check the documentation, you can see that the argument fields
of StructType
is of type Array[StructField]
and you are passing StructField
.
This means that you should wrap your StructField
with Array
, for example:
val simpleSchema = StructType(Array(
StructField("source_id", StringType, true))
)
Good luck!
EDIT
The case with one parameter in createDataframe
:
val data = Seq(
Data(1, "test"),
Data(2, "test2")
)
val dataDf = spark.createDataFrame(data)
dataDf.show(10, false)
The case with two parameterse in createDataframe
:
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val someData = Seq(Row(1, "test"), Row(2, "test2"))
val someDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
The output for both cases is the same:
------ -----
|number|word |
------ -----
|1 |test |
|2 |test2|
------ -----
In your case, the schema is trying to be inferred from attributes of the class (StructType
) and is trying to be populated with StructField: source_id
. StructType
extends DataType
and that is where your error comes from (Spark can not resolve the type)