What is the easiest procedure to go from this kind of rdd/spark dataframe:
date | Tokyo | New York |
---|---|---|
01/01 | 1 | 2 |
02/01 | 3 | 2 |
03/01 | 4 | 5 |
to the same data in this form below?
city | date | value |
---|---|---|
Tokyo | 01/01 | 1 |
New York | 01/01 | 2 |
Tokyo | 02/01 | 3 |
New York | 02/01 | 2 |
Tokyo | 03/01 | 4 |
New York | 03/01 | 5 |
CodePudding user response:
I would solve this with pyspark sql
using functions like create_map
and explode
As below -
from pyspark.sql import functions as func
df1= df.withColumn('mapCol',
func.create_map(func.lit('Tokyo'),df.Tokyo,
func.lit('New York'),df["New York"]
)
)
res = df1.select('*',func.explode(df1.mapCol).alias('city','value')).drop("Tokyo", "New York", "mapCol")
res.show()
Output :
----- -------- -----
| date| city|value|
----- -------- -----
|01/01| Tokyo| 1|
|01/01|New York| 2|
|02/01| Tokyo| 3|
|02/01|New York| 2|
|03/01| Tokyo| 4|
|03/01|New York| 5|
----- -------- -----
CodePudding user response:
There is a much simpler solution, using stack
apache-spark-sql
with t (date, Tokyo, `New York`) as (select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5))
-- The solution starts here
select date, stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)
from t
date | city | value |
---|---|---|
01/01 | Tokyo | 1 |
01/01 | New York | 2 |
02/01 | Tokyo | 3 |
02/01 | New York | 2 |
03/01 | Tokyo | 4 |
03/01 | New York | 5 |
pyspark
df = spark.sql("select stack(3 ,'01/01',1,2 ,'02/01',3,2 ,'03/01',4,5) as (date, Tokyo, `New York`)")
#The solution starts here
df.selectExpr("date", "stack(2, 'Tokyo',Tokyo,'New York',`New York`) as (city,value)").show()
----- -------- -----
| date| city|value|
----- -------- -----
|01/01| Tokyo| 1|
|01/01|New York| 2|
|02/01| Tokyo| 3|
|02/01|New York| 2|
|03/01| Tokyo| 4|
|03/01|New York| 5|
----- -------- -----