I have a column in DataFrame which is currently in String format having multiple comma separated double datatype values (mostly 2 or 3). Refer to below schema snapshot.
Sample : "619.619620621622, 123.12412512699"
root
|-- MyCol: string (nullable = true)
I want to convert it to an Array of double which should look like below schema.
Desired : array<double>
[619.619620621622, 123.12412512699]
root
|-- MyCol: array (nullable = true)
| |-- element_value: double (containsNull = true)
I know how to do it on single string value. Now I want to to it on complete DataFrame column.
Is there any way this could be done using single/ double liner code?
CodePudding user response:
Assuming the starting point:
val spark: SparkSession = ???
import spark.implicits._
val df: DataFrame = ???
here is a solution based on UDF:
import org.apache.spark.sql.functions._
def toDoubles: UserDefinedFunction =
udf { string: String =>
string
.split(",")
.map(_.trim) //based on your input you may need to trim the strings
.map(_.toDouble)
}
df
.select(toDoubles($"MyCol") as "doubles")
CodePudding user response:
split
cast
should do the job:
import org.apache.spark.sql.functions.{col, split}
val df = Seq(("619.619620621622, 123.12412512699")).toDF("MyCol")
val df2 = df.withColumn("myCol", split(col("MyCol"), ",").cast("array<double>"))
df2.printSchema
//root
// |-- myCol: array (nullable = true)
// | |-- element: double (containsNull = true)