I have a column of lists in a spark dataframe.
----- ----------
|c1 | c2 |
----- ----------
|a |[1, 0, 1, 1] |
|b |[0, 1, 1, 0] |
|c |[1, 1, 0, 0] |
----- ----------
How do I convert this into another spark dataframe where each list is turned into a dataframe column? Also each entry from column 'c1' is the name of the new column created. Something like below.
--------
|a| b | c|
--------
|1 |0| 1 |
|0 |0| 1 |
|1 |1| 0 |
|1 |0| 0 |
--------
Note: I did think about following this: Convert Column of List to Dataframe and then taking a transpose of the resultant matrix. But, this creates quite a lot of columns [as the size of the list data I have is pretty huge] and therefore isn't an efficient solution.
Any help is welcome.
CodePudding user response:
import pyspark.sql.functions as F
#Not a part of the solution, only used to generate the data sample
df = spark.sql("select stack(3 ,'a',array(1, 0, 1, 1), 'b',array(0, 1, 1, 0) ,'c',array(1, 1, 0, 0)) as (c1,c2)")
df.groupBy().pivot('c1').agg(F.first('c2')).selectExpr('inline(arrays_zip(*))').show()
--- --- ---
| a| b| c|
--- --- ---
| 1| 0| 1|
| 0| 1| 1|
| 1| 1| 0|
| 1| 0| 0|
--- --- ---
This can be easily tested for large datasets
df = sql("select id as c1, transform(sequence(1,10000), e -> tinyint(round(rand()))) as c2 from range(10000)")
Just completed a succesfull execution of 10K arrays, 10K elements each, on a VM with 4 cores & 32 GB RAM (Azure Databricks).
Took 5.35 minutes.