i have a tuple of 2 elements like this :
Tuple2(“Sting1, String2”, ArrayList(“String3”, “String4”))
=> 1st element is a string with comma separated string values
=> 2nd element is an arraylist that contains a list of string
i would like to have a data frame like this :
Col1 Col2 Col3
1 String1 String3
2 String1 String4
3 String2 String3
4 String2 String4
CodePudding user response:
TL;DR
import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}
df
// `split` "String1, String2" into separate values, then create a row per value using `explode`
.withColumn("Col2", explode(split(col("_1"), ", ")))
// create a row per value in the list: "String3", "String4"
.withColumn("Col3", explode(col("_2")))
// now that we have our 4 rows, add a new column with an incrementing number
.withColumn("Col1", monotonically_increasing_id() 1)
// only keep the columns we care about
.select("Col1", "Col2", "Col3")
.show(false)
Full Answer
Starting with your example:
val tuple2 = Tuple2("String1, String2", List("String3", "String4"))
and turning it into a DataFrame:
val df = List(tuple2).toDF("_1", "_2")
df.show(false)
which gives:
---------------- ------------------
|_1 |_2 |
---------------- ------------------
|String1, String2|[String3, String4]|
---------------- ------------------
Now we are ready for the transformation:
import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}
df
// `split` "String1, String2" into separate values, then create a row per value using `explode`
.withColumn("Col2", explode(split(col("_1"), ", ")))
// create a row per value in the list: "String3", "String4"
.withColumn("Col3", explode(col("_2")))
// now that we have our 4 rows, add a new column with an incrementing number
.withColumn("Col1", monotonically_increasing_id() 1)
// only keep the columns we care about
.select("Col1", "Col2", "Col3")
.show(false)
which gives:
---- ------- -------
|Col1|Col2 |Col3 |
---- ------- -------
|1 |String1|String3|
|2 |String1|String4|
|3 |String2|String3|
|4 |String2|String4|
---- ------- -------
Extra reading for further detail
It is worth noting that the order of the operations is key:
- First we explode
"String1"
and"String2"
into their own rows:
df
.withColumn("Col2", explode(split(col("_1"), ", ")))
.select("Col2")
.show(false)
gives:
-------
|Col2 |
-------
|String1|
|String2|
-------
where we go from 1 original row, to two.
- then we explode
"String3", "String4"
:
df
.withColumn("Col2", explode(split(col("_1"), ", ")))
.withColumn("Col3", explode(col("_2")))
.select("Col2", "Col3")
.show(false)
gives:
------- -------
|Col2 |Col3 |
------- -------
|String1|String3|
|String1|String4|
|String2|String3|
|String2|String4|
------- -------
- finally we add the incrementing count. If we did this earlier we copy the same number value to multiple rows.
for example:
df
// here we add `Col1` to a Dataset of only one row! So we only have the value `1`
.withColumn("Col1", monotonically_increasing_id() 1)
// here we explode row 1, copying the value of `Col1`
.withColumn("Col2", explode(split(col("_1"), ", ")))
.withColumn("Col3", explode(col("_2")))
.select("Col1", "Col2", "Col3")
.show(false)
gives:
---- ------- -------
|Col1|Col2 |Col3 |
---- ------- -------
|1 |String1|String3|
|1 |String1|String4|
|1 |String2|String3|
|1 |String2|String4|
---- ------- -------