Home > Mobile >  How to change structure of RDD/Spark Dataframe?
How to change structure of RDD/Spark Dataframe?

Time:03-10

What is the easiest procedure to go from this kind of rdd/spark dataframe:

date Tokyo New York
01/01 1 2
02/01 3 2
03/01 4 5

to the same data in this form below?

city date value
Tokyo 01/01 1
New York 01/01 2
Tokyo 02/01 3
New York 02/01 2
Tokyo 03/01 4
New York 03/01 5

CodePudding user response:

I would solve this with pyspark sql using functions like create_map and explode

As below -

from pyspark.sql import functions as func

df1= df.withColumn('mapCol',
                    func.create_map(func.lit('Tokyo'),df.Tokyo,
                                    func.lit('New York'),df["New York"]
                                   ) 
                  )

res = df1.select('*',func.explode(df1.mapCol).alias('city','value')).drop("Tokyo", "New York", "mapCol")
res.show()

Output :

 ----- -------- ----- 
| date|    city|value|
 ----- -------- ----- 
|01/01|   Tokyo|    1|
|01/01|New York|    2|
|02/01|   Tokyo|    3|
|02/01|New York|    2|
|03/01|   Tokyo|    4|
|03/01|New York|    5|
 ----- -------- ----- 

CodePudding user response:

There is a much simpler solution, using stack

apache-spark-sql

with t (date, Tokyo, `New York`) as (select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5))

-- The solution starts here

select date, stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)
from   t
date city value
01/01 Tokyo 1
01/01 New York 2
02/01 Tokyo 3
02/01 New York 2
03/01 Tokyo 4
03/01 New York 5

pyspark

df = spark.sql("select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5) as (date, Tokyo, `New York`)")

#The solution starts here
df.selectExpr("date", "stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)").show()

 ----- -------- ----- 
| date|    city|value|
 ----- -------- ----- 
|01/01|   Tokyo|    1|
|01/01|New York|    2|
|02/01|   Tokyo|    3|
|02/01|New York|    2|
|03/01|   Tokyo|    4|
|03/01|New York|    5|
 ----- -------- ----- 
  • Related