Spark cast all LongType columns to IntegerType dynamically-CodePudding

I dont have schema information because, i have different tables and the dataframe will be created with their data. So i want to detect any LongType column and cast it to IntegerType.

My approach is the create new dataFrame with the new schema which LongType fields converted to IntegerType.

val df = spark.read.format("bigquery").load(sql)

// cast long types to int
val newSchemaArr = df.schema.fields.map(f => if(f.dataType.isInstanceOf[LongType]) StructField(name = f.name, dataType = IntegerType, nullable = f.nullable) else f)
val newSchema = new StructType(newSchemaArr)
val df2 = spark.createDataFrame(df.rdd, newSchema)
// write to hdfs files
df2.write.format("avro").save(destinationPath)

But i got this error, when i writing the data.

Caused by: java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of int

Is there any solution to fix, or any another approach to handle this problem?

Spark version: 3.2.0
Scala version: 2.12

CodePudding user response：

The easiest way to do this is to simply cast columns when necessary:

// cast long columns to integer
val columns = df.schema.map {
  case StructField(name, LongType, _, _) => col(name).cast(IntegerType)
  case f => col(f.name)
}
// write modified columns 
df.select(columns: _*).write.format("avro").save(destinationPath)

CodePudding user response：

import org.apache.spark.sql.types.{IntegerType, LongType}
val sch = df.schema
val df2 = sch.fieldNames.foldLeft(df) { (tmpDF, colName) =>
  if (tmpDF.schema(colName).dataType == LongType)
    tmpDF.withColumn(colName, col(colName).cast(IntegerType))
  else tmpDF
}