I have a PySpark dataframe as below:
Id | variable | old_val | new_val |
---|---|---|---|
a1 | frequency | 2.0 | 25.0 |
a1 | latitude | 25.762 | 25.729 |
a1 | longitude | -80.192 | -80.436 |
a2 | frequency | 1.0 | 5.0 |
a2 | latitude | 25.7 | 25.762 |
a2 | longitude | -80.436 | -80.192 |
I am trying to reflect the changes by "id".
I would like to achieve the below ideal state:
Id | freq_old_val | freq_new_val | lat_old_val | lat_new_val | long_old_val | long_new_val |
---|---|---|---|---|---|---|
a1 | 2.0 | 25.0 | 25.762 | 25.729 | -80.192 | -80.436 |
a2 | 1.0 | 5.0 | 25.7 | 25.762 | -80.436 | -80.192 |
My useless code with a useful attempt
I am unsure if i must use explode. I am also unsure if agg can be passed with two column values.
import org.apache.spark.sql.functions._
df.groupBy("id").pivot("variable").agg(first("old_val","new_val"))
I am fairly new to pyspark, working my way through it. Any guidance and help is highly appreciated. Thank you for taking the time to guide.
CodePudding user response:
I think similar question is already answered here: How to pivot on multiple columns in Spark SQL?
Please comment if it is not clear