Home > Blockchain >  Spark tuple to dataframe
Spark tuple to dataframe

Time:04-12

i have a tuple of 2 elements like this :

Tuple2(“Sting1, String2”, ArrayList(“String3”, “String4”))
=> 1st element is a string with comma separated string values
=> 2nd element is an arraylist that contains a list of string

i would like to have a data frame like this :

Col1        Col2        Col3
1           String1     String3
2           String1     String4
3           String2     String3
4           String2     String4

CodePudding user response:

TL;DR

import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}

df
    // `split` "String1, String2" into separate values, then create a row per value using `explode`
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    // create a row per value in the list: "String3", "String4"
    .withColumn("Col3", explode(col("_2")))
    // now that we have our 4 rows, add a new column with an incrementing number
    .withColumn("Col1", monotonically_increasing_id()   1)
    // only keep the columns we care about
    .select("Col1", "Col2", "Col3")
    .show(false)

Full Answer

Starting with your example:

val tuple2 = Tuple2("String1, String2", List("String3", "String4"))

and turning it into a DataFrame:

val df = List(tuple2).toDF("_1", "_2")

df.show(false)

which gives:

 ---------------- ------------------ 
|_1              |_2                |
 ---------------- ------------------ 
|String1, String2|[String3, String4]|
 ---------------- ------------------ 

Now we are ready for the transformation:

import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}

df
    // `split` "String1, String2" into separate values, then create a row per value using `explode`
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    // create a row per value in the list: "String3", "String4"
    .withColumn("Col3", explode(col("_2")))
    // now that we have our 4 rows, add a new column with an incrementing number
    .withColumn("Col1", monotonically_increasing_id()   1)
    // only keep the columns we care about
    .select("Col1", "Col2", "Col3")
    .show(false)

which gives:

 ---- ------- ------- 
|Col1|Col2   |Col3   |
 ---- ------- ------- 
|1   |String1|String3|
|2   |String1|String4|
|3   |String2|String3|
|4   |String2|String4|
 ---- ------- ------- 

Extra reading for further detail

It is worth noting that the order of the operations is key:

  1. First we explode "String1" and "String2" into their own rows:
df
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    .select("Col2")
    .show(false)

gives:

 ------- 
|Col2   |
 ------- 
|String1|
|String2|
 ------- 

where we go from 1 original row, to two.

  1. then we explode "String3", "String4":
df
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    .withColumn("Col3", explode(col("_2")))
    .select("Col2", "Col3")
    .show(false)

gives:

 ------- ------- 
|Col2   |Col3   |
 ------- ------- 
|String1|String3|
|String1|String4|
|String2|String3|
|String2|String4|
 ------- ------- 
  1. finally we add the incrementing count. If we did this earlier we copy the same number value to multiple rows.

for example:

df
    // here we add `Col1` to a Dataset of only one row! So we only have the value `1`
    .withColumn("Col1", monotonically_increasing_id()   1)
    // here we explode row 1, copying the value of `Col1`
    .withColumn("Col2", explode(split(col("_1"), ", ")))
    .withColumn("Col3", explode(col("_2")))
    .select("Col1", "Col2", "Col3")
    .show(false)

gives:

 ---- ------- ------- 
|Col1|Col2   |Col3   |
 ---- ------- ------- 
|1   |String1|String3|
|1   |String1|String4|
|1   |String2|String3|
|1   |String2|String4|
 ---- ------- ------- 
  • Related