One question: how can I transpose rows into columns on pyspark? My original dataframe looks like this:
ID | DATE | APP | DOWNLOADS | ACTIVE_USERS
___________________________________________________________
0 | 2021-01-10 | FACEBOOK | 1000 | 5000
1 | 2021-01-10 | INSTAGRAM | 9000 | 90000
2 | 2021-02-10 | FACEBOOK | 9000 | 72000
3 | 2021-02-10 | INSTAGRAM | 16000 | 500000
But I need it like this:
ID | DATE | FACEBOOK - DOWNLOADS | FACEBOOK - ACTIVE_USERS | INSTAGRAM - DOWNLOADS | INSTAGRAM - ACTIVE_USERS
___________________________________________________________________________________________________________________
0 | 2021-01-10 | 1000 | 5000 | 9000 | 90000
1 | 2021-02-10 | 9000 | 72000 | 16000 | 50000
I tried using the answer on this question: Transpose pyspark rows into columns, but i couldn't make it work.
Could you help me please? Thank you!
CodePudding user response:
From your example I assume the "ID" column is not needed to group on, as it looks to be reset in your outcome. That would make the query something like below:
import pyspark.sql.functions as F
df.groupBy("DATE").pivot('APP').agg(
F.first('DOWNLOADS').alias("DOWNLOADS"),
F.first("ACTIVE_USERS").alias("ACTIVE_USERS")
)
We groupby the date and pivot on app and retrieve the first value for downloads and active users.
outcome:
---------- ------------------ --------------------- ------------------- ----------------------
| DATE|FACEBOOK_DOWNLOADS|FACEBOOK_ACTIVE_USERS|INSTAGRAM_DOWNLOADS|INSTAGRAM_ACTIVE_USERS|
---------- ------------------ --------------------- ------------------- ----------------------
|2021-02-10| 9000| 72000| 16000| 500000|
|2021-01-10| 1000| 5000| 9000| 90000|
---------- ------------------ --------------------- ------------------- ----------------------