Home > Blockchain >  Apply a transformation to all the columns with the same data type on Spark
Apply a transformation to all the columns with the same data type on Spark

Time:08-18

I need to apply a transformation to all the Integer columns of my Data Frame before writting a CSV. The transformation consists on changing the type to String and then transform the format to the European one (E.g. 1234567 -> "1234567" -> "1.234.567").

Has Spark any way to apply this transformation to all the Integer Columns? I want it to be a generic functionality (because I need to write multiple CSVs) instead of hardcoding all the columns to transform for each dataframe.

CodePudding user response:

DataFrame has dtypes method, which returns column names along with their data types: Array[("Column name", "Data Type")].

You can map this array, applying different expressions to each column, based on their data type. And you can then pass this mapped list to the select method:

import spark.implicits._
import org.apache.spark.sql.functions._

val dataSeq = Seq(
    (1246984,  993922, "test_1"),
    (246984,  993922, "test_2"),
    (246984,  993922, "test_3"))

val df = dataSeq.toDF("int_1", "int_2", "str_3")
df.show
 ------- ------ ------ 
|  int_1| int_2| str_3|
 ------- ------ ------ 
|1246984|993922|test_1|
| 246984|993922|test_2|
| 246984|993922|test_3|
 ------- ------ ------ 

val columns = 
  df.dtypes.map{
    case (c, "IntegerType") => regexp_replace(format_number(col(c), 0), ",", ".").as(c)
    case (c, t) => col(c)
  }

val df2 = df.select(columns:_*)
df2.show
 --------- ------- ------ 
|    int_1|  int_2| str_3|
 --------- ------- ------ 
|1,246,984|993,922|test_1|
|  246,984|993,922|test_2|
|  246,984|993,922|test_3|
 --------- ------- ------ 
  • Related