How to change structure of RDD/Spark Dataframe?-CodePudding

What is the easiest procedure to go from this kind of rdd/spark dataframe:

date	Tokyo	New York
01/01	1	2
02/01	3	2
03/01	4	5

to the same data in this form below?

city	date	value
Tokyo	01/01	1
New York	01/01	2
Tokyo	02/01	3
New York	02/01	2
Tokyo	03/01	4
New York	03/01	5

CodePudding user response：

I would solve this with pyspark sql using functions like create_map and explode

As below -

from pyspark.sql import functions as func

df1= df.withColumn('mapCol',
                    func.create_map(func.lit('Tokyo'),df.Tokyo,
                                    func.lit('New York'),df["New York"]
                                   ) 
                  )

res = df1.select('*',func.explode(df1.mapCol).alias('city','value')).drop("Tokyo", "New York", "mapCol")
res.show()

Output :

 ----- -------- ----- 
| date|    city|value|
 ----- -------- ----- 
|01/01|   Tokyo|    1|
|01/01|New York|    2|
|02/01|   Tokyo|    3|
|02/01|New York|    2|
|03/01|   Tokyo|    4|
|03/01|New York|    5|
 ----- -------- -----

CodePudding user response：

There is a much simpler solution, using stack

apache-spark-sql

with t (date, Tokyo, `New York`) as (select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5))

-- The solution starts here

select date, stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)
from   t

date	city	value
01/01	Tokyo	1
01/01	New York	2
02/01	Tokyo	3
02/01	New York	2
03/01	Tokyo	4
03/01	New York	5

pyspark

df = spark.sql("select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5) as (date, Tokyo, `New York`)")

#The solution starts here
df.selectExpr("date", "stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)").show()

 ----- -------- ----- 
| date|    city|value|
 ----- -------- ----- 
|01/01|   Tokyo|    1|
|01/01|New York|    2|
|02/01|   Tokyo|    3|
|02/01|New York|    2|
|03/01|   Tokyo|    4|
|03/01|New York|    5|
 ----- -------- -----