Home > Back-end >  Convert Column of List to a Dataframe Column
Convert Column of List to a Dataframe Column

Time:03-12

I have a column of lists in a spark dataframe.


 ----- ---------- 
|c1   |  c2      |
 ----- ---------- 
|a |[1, 0, 1, 1] |
|b |[0, 1, 1, 0] |
|c |[1, 1, 0, 0] |
 ----- ---------- 

How do I convert this into another spark dataframe where each list is turned into a dataframe column? Also each entry from column 'c1' is the name of the new column created. Something like below.

 -------- 
|a| b | c| 
 -------- 
|1 |0| 1 |
|0 |0| 1 |
|1 |1| 0 |
|1 |0| 0 |
 -------- 

Note: I did think about following this: Convert Column of List to Dataframe and then taking a transpose of the resultant matrix. But, this creates quite a lot of columns [as the size of the list data I have is pretty huge] and therefore isn't an efficient solution.

Any help is welcome.

CodePudding user response:

import pyspark.sql.functions as F

#Not a part of the solution, only used to generate the data sample
df = spark.sql("select stack(3 ,'a',array(1, 0, 1, 1), 'b',array(0, 1, 1, 0) ,'c',array(1, 1, 0, 0)) as (c1,c2)")

df.groupBy().pivot('c1').agg(F.first('c2')).selectExpr('inline(arrays_zip(*))').show()

 --- --- --- 
|  a|  b|  c|
 --- --- --- 
|  1|  0|  1|
|  0|  1|  1|
|  1|  1|  0|
|  1|  0|  0|
 --- --- --- 

This can be easily tested for large datasets

df = sql("select id as c1, transform(sequence(1,10000), e -> tinyint(round(rand()))) as c2 from range(10000)")

Just completed a succesfull execution of 10K arrays, 10K elements each, on a VM with 4 cores & 32 GB RAM (Azure Databricks).
Took 5.35 minutes.

  • Related