Home > Software engineering >  Spark (Scala) update DataFrame
Spark (Scala) update DataFrame

Time:12-12

I create the dataframe with schema in the following way:

val rdd = sc.parallelize(
Seq(
  Row("first", 2.0),
  Row("test", 1.5),
  Row("choose", 8.0)
 )
)

val schema: StructType = new StructType()
.add(StructField("id", StringType, true))
.add(StructField("val1", DoubleType, true))

val dfWithSchema = spark.createDataFrame(rdd, schema)

And I want to update id-column with arbitrary value:

I tried this:

 dfWithSchema.withColumn("id", col("id"). (Random.nextString(10)))

But without expected result. Is there any way to do this by Spark 2.13 - ?

CodePudding user response:

You can concatenate with spark using the concat function:

dfWithSchema.withColumn("id", concat(col("id"),lit(Random.nextString(10)))).show()

CodePudding user response:

I found out the following solution:

dfWithSchema.withColumn("id", when(col("id").isNotNull, Random.nextString(10)))

However I am surprised that there are no direct way to update dataframe with new column values, but only by condition of the existing column values.

  • Related