I have following data
df.show
--------- -------------------- ------------
| name| age| tokens| tokensCount|
--------- ---- --------------- ------------
| Alice| 29| [A,B,C]| 3|
| Bob| 28| [A,B,C,D]| 4|
| Charlie| 29| [A,B,C,D,E]| 5|
--------- ---- --------------- ------------
I transform data with following command
val newDF = df.select(($"name") : (0 until 4).map(i => ($"tokens")(i).alias(s"token$i")): _*).show
--------- ------- ------- ------- -------
| name| token0| token1| token2| token3|
--------- ------- ------- ------- -------
| Alice| A| B| C| null|
| Bob| A| B| C| D|
| Charlie| A| B| C| D|
--------- ------- ------- ------- -------
I want to give tokensCount
instead of static value 4
at (0 until 4)
I tried a few things like $"tokensCount"
and size($"tokens")
, but could not get through.
Can anyone suggest how to loop or map according to the size of array or count of array ?
Many thanks
CodePudding user response:
You can modify your code to find the maximum length of tokens
, and then use that to create the necessary columns:
val df = Seq(
("Alice", 29, Array("A", "B", "C")),
("Bob", 28, Array("A", "B", "C", "D")),
("Charlie", 29, Array("A", "B", "C", "D", "E")),
).toDF("name", "age", "tokens")
val maxTokenCount = df.withColumn("token_count", size(col("tokens"))).agg(max("token_count") as "mtc")
val newDF = df.select(($"name") : (0 until maxTokenCount.first.getInt(0)).map(i => ($"tokens")(i).alias(s"token$i")): _*).show
Which will give you:
------- ------ ------ ------ ------ ------
| name|token0|token1|token2|token3|token4|
------- ------ ------ ------ ------ ------
| Alice| A| B| C| null| null|
| Bob| A| B| C| D| null|
|Charlie| A| B| C| D| E|
------- ------ ------ ------ ------ ------
It might be useful to explain why you want to do this transformation, as there might be a much more efficient way. This has the potential to create a very sparse dataframe. Imagine that most names have no tokens, but Bob has 100 tokens: all of a sudden you have one hundred columns of mostly null values.