I have a table:
Department | Name | Start | End |
---|---|---|---|
Finance | John Doe | 01/01/2022 | 01/05/2002 |
Marketing | Mark Smith | 05/02/2022 | 08/03/2002 |
I want to transform the table with what a cross join unnest
would do in SQL.
So the output I'm looking for is:
Department | Name | Event | Date |
---|---|---|---|
Finance | John Doe | Start | 01/01/2022 |
Finance | John Doe | End | 01/05/2002 |
Marketing | Mark Smith | Start | 05/02/2022 |
Marketing | Mark Smith | End | 08/03/2002 |
How can I achieve this with a PySpark dataframe?
CodePudding user response:
Using the example you provided, stack
can do it:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('Finance', 'John Doe', '01/01/2022', '01/05/2002'),
('Marketing', 'Mark Smith', '05/02/2022', '08/03/2002')],
['Department', 'Name', 'Start', 'End'])
df = df.select('Department', 'Name', F.expr("stack(2, 'Start', Start, 'End', End) as (Event, Date)"))
df.show()
# ---------- ---------- ----- ----------
# |Department| Name|Event| Date|
# ---------- ---------- ----- ----------
# | Finance| John Doe|Start|01/01/2022|
# | Finance| John Doe| End|01/05/2002|
# | Marketing|Mark Smith|Start|05/02/2022|
# | Marketing|Mark Smith| End|08/03/2002|
# ---------- ---------- ----- ----------