Home > OS >  Spark: Transform array to Column with size of Array using Map iterable
Spark: Transform array to Column with size of Array using Map iterable

Time:03-18

I have following data

df.show

 --------- -------------------- ------------ 
|     name| age|         tokens| tokensCount|
 --------- ---- --------------- ------------ 
|    Alice|  29|        [A,B,C]|           3|
|      Bob|  28|      [A,B,C,D]|           4|
|  Charlie|  29|    [A,B,C,D,E]|           5|    
 --------- ---- --------------- ------------ 

I transform data with following command

val newDF = df.select(($"name")  : (0 until 4).map(i => ($"tokens")(i).alias(s"token$i")): _*).show

 --------- ------- ------- ------- ------- 
|     name| token0| token1| token2| token3| 
 --------- ------- ------- ------- ------- 
|    Alice|      A|      B|      C|   null|
|      Bob|      A|      B|      C|      D|
|  Charlie|      A|      B|      C|      D|
 --------- ------- ------- ------- ------- 

I want to give tokensCount instead of static value 4 at (0 until 4)

I tried a few things like $"tokensCount" and size($"tokens"), but could not get through.

Can anyone suggest how to loop or map according to the size of array or count of array ?

Many thanks

CodePudding user response:

You can modify your code to find the maximum length of tokens, and then use that to create the necessary columns:

val df = Seq(
  ("Alice", 29, Array("A", "B", "C")),
  ("Bob", 28, Array("A", "B", "C", "D")),
  ("Charlie", 29, Array("A", "B", "C", "D", "E")),
).toDF("name", "age", "tokens")

val maxTokenCount = df.withColumn("token_count", size(col("tokens"))).agg(max("token_count") as "mtc")

val newDF = df.select(($"name")  : (0 until maxTokenCount.first.getInt(0)).map(i => ($"tokens")(i).alias(s"token$i")): _*).show

Which will give you:

 ------- ------ ------ ------ ------ ------ 
|   name|token0|token1|token2|token3|token4|
 ------- ------ ------ ------ ------ ------ 
|  Alice|     A|     B|     C|  null|  null|
|    Bob|     A|     B|     C|     D|  null|
|Charlie|     A|     B|     C|     D|     E|
 ------- ------ ------ ------ ------ ------ 

It might be useful to explain why you want to do this transformation, as there might be a much more efficient way. This has the potential to create a very sparse dataframe. Imagine that most names have no tokens, but Bob has 100 tokens: all of a sudden you have one hundred columns of mostly null values.

  • Related