I need to apply a transformation to all the Integer columns of my Data Frame before writting a CSV. The transformation consists on changing the type to String and then transform the format to the European one (E.g. 1234567 -> "1234567" -> "1.234.567").
Has Spark any way to apply this transformation to all the Integer Columns? I want it to be a generic functionality (because I need to write multiple CSVs) instead of hardcoding all the columns to transform for each dataframe.
CodePudding user response:
DataFrame has dtypes method, which returns column names along with their data types: Array[("Column name", "Data Type")]
.
You can map this array, applying different expressions to each column, based on their data type. And you can then pass this mapped list to the select
method:
import spark.implicits._
import org.apache.spark.sql.functions._
val dataSeq = Seq(
(1246984, 993922, "test_1"),
(246984, 993922, "test_2"),
(246984, 993922, "test_3"))
val df = dataSeq.toDF("int_1", "int_2", "str_3")
df.show
------- ------ ------
| int_1| int_2| str_3|
------- ------ ------
|1246984|993922|test_1|
| 246984|993922|test_2|
| 246984|993922|test_3|
------- ------ ------
val columns =
df.dtypes.map{
case (c, "IntegerType") => regexp_replace(format_number(col(c), 0), ",", ".").as(c)
case (c, t) => col(c)
}
val df2 = df.select(columns:_*)
df2.show
--------- ------- ------
| int_1| int_2| str_3|
--------- ------- ------
|1,246,984|993,922|test_1|
| 246,984|993,922|test_2|
| 246,984|993,922|test_3|
--------- ------- ------