SCALA
I have a table with this struct:
FName | SName | Jan 2021 | Feb 2021 | Mar 2021 | Total 2021 | |
---|---|---|---|---|---|---|
Micheal | Scott | [email protected] | 4000 | 5000 | 3400 | 50660 |
Dwight | Schrute | [email protected] | 1200 | 6900 | 1000 | 35000 |
Kevin | Malone | [email protected] | 9000 | 6000 | 18000 | 32000 |
And i want to transform it to:
I tried with 'stack' method but i couldn't get it to work. Thanks
CodePudding user response:
You can flatten the monthly/total columns via explode
as shown below:
val df = Seq(
("Micheal", "Scott", "[email protected]", 4000, 5000, 3400, 50660),
("Dwight", "Schrute", "[email protected]", 1200, 6900, 1000, 35000),
("Kevin", "Malone", "[email protected]", 9000, 6000, 18000, 32000)
).toDF("FName","SName", "Email", "Jan 2021", "Feb 2021", "Mar 2021", "Total 2021")
val moYrCols = Array("Jan 2021", "Feb 2021", "Mar 2021", "Total 2021") // (**)
val otherCols = df.columns diff moYrCols
val structCols = moYrCols.map{ c =>
val moYr = split(lit(c), "\\s ")
struct(moYr(1).as("Year"), moYr(0).as("Month"), col(c).as("Value"))
}
df.
withColumn("flattened", explode(array(structCols: _*))).
select(otherCols.map(col) : $"flattened.*": _*).
show
/*
------- ------- ------------------ ---- ----- -----
| FName| SName| Email|Year|Month|Value|
------- ------- ------------------ ---- ----- -----
|Micheal| Scott| [email protected]|2021| Jan| 4000|
|Micheal| Scott| [email protected]|2021| Feb| 5000|
|Micheal| Scott| [email protected]|2021| Mar| 3400|
|Micheal| Scott| [email protected]|2021|Total|50660|
| Dwight|Schrute|[email protected]|2021| Jan| 1200|
| Dwight|Schrute|[email protected]|2021| Feb| 6900|
| Dwight|Schrute|[email protected]|2021| Mar| 1000|
| Dwight|Schrute|[email protected]|2021|Total|35000|
| Kevin| Malone| [email protected]|2021| Jan| 9000|
| Kevin| Malone| [email protected]|2021| Feb| 6000|
| Kevin| Malone| [email protected]|2021| Mar|18000|
| Kevin| Malone| [email protected]|2021|Total|32000|
------- ------- ------------------ ---- ----- -----
*/
(**) Use pattern matching in case there are many columns; for example:
val moYrCols = df.columns.filter(_.matches("[A-Za-z] \\s \\d{4}"))
CodePudding user response:
val data = Seq(
("Micheal","Scott","[email protected]",4000,5000,3400,50660),
("Dwight","Schrute","[email protected]",1200,6900,1000,35000),
("Kevin","Malone","[email protected]",9000,6000,18000,32000)) )
val columns = Seq("FName","SName","Email","Jan 2021","Feb 2021","Mar 2021","Total 2021")
val newColumns = Array( "FName", "SName", "Email","Total 2021" )
val df = spark.createDataFrame( data ).toDF(columns:_*)
df
.select(
struct(
(for {column <- df.columns } yield col(column)).toSeq :_*
).as("mystruct")) // create your data set with a column as a struct.
.select(
$"mystruct.Fname", // refer to sub element of struct with '.' operator
$"mystruct.sname",
$"mystruct.Email",
explode( /make rows for every entry in the array.
array(
(for {column <- df.columns if !(newColumns contains column) } //filter out the columns we already selected
yield // for each element yield the following expression (similar to map)
struct(
col(s"mystruct.$column").as("value"), // create the value column
lit(column).as("date_year")) // create a date column
).toSeq :_* ) // shorthand to pass scala array into varargs for array function
)
)
.select(
col("*"), // just being lazy instead of typing.
col("col.*") // create columns from new column. Seperating the year/date should be easy from here.
).drop($"col")
.show(false)
-------------- -------------- ------------------ ----- ---------
|mystruct.Fname|mystruct.sname|mystruct.Email |value|date_year|
-------------- -------------- ------------------ ----- ---------
|Micheal |Scott |[email protected] |4000 |Jan 2021 |
|Micheal |Scott |[email protected] |5000 |Feb 2021 |
|Micheal |Scott |[email protected] |3400 |Mar 2021 |
|Dwight |Schrute |[email protected]|1200 |Jan 2021 |
|Dwight |Schrute |[email protected]|6900 |Feb 2021 |
|Dwight |Schrute |[email protected]|1000 |Mar 2021 |
|Kevin |Malone |[email protected] |9000 |Jan 2021 |
|Kevin |Malone |[email protected] |6000 |Feb 2021 |
|Kevin |Malone |[email protected] |18000|Mar 2021 |
-------------- -------------- ------------------ ----- ---------