Home > Mobile >  Split row into multiple rows to limit length of array in column (spark / scala)
Split row into multiple rows to limit length of array in column (spark / scala)

Time:02-11

I have a dataframe that looks like this:

 -------------- -------------------- 
|id            |           items    |
 -------------- -------------------- 
|             1|[a, b, .... x, y, z]|
 -------------- -------------------- 
|             1|[q, z, .... x, b, 5]|
 -------------- -------------------- 
|             2|[q, z, .... x, b, 5]|
 -------------- -------------------- 

I want to split the rows so that the array in the items column is at most length 20. If an array has length greater than 20, I would want to make new rows and split the array up so that each array is of length 20 or less. So for the first row in my example dataframe, if we assume the length is 10 and I want at most length 3 for each row, I would like for it to be split like this:

 -------------- -------------------- 
|id            |           items    |
 -------------- -------------------- 
|             1|[a, b, c]           |
 -------------- -------------------- 
|             1|[z, y, z]           |
 -------------- -------------------- 
|             1|[e, f, g]           |
 -------------- -------------------- 
|             1|[q]                 |
 -------------- -------------------- 

Ideally, all rows should be of length 3 except the last row if the length of the array is not evenly divisible by the max desired length. Note - the id column is not unique

CodePudding user response:

You could try this:

import pandas as pd

max_item_length = 3

df = pd.DataFrame(
    {"fake_index": [1, 2, 3],
     "items": [["a", "b", "c", "d", "e"], ["f", "g", "h", "i", "j"], ["k", "l"]]}
)

df2 = pd.DataFrame({"fake_index": [], "items": []})

for i in df.index:
    try:
        df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1][:max_item_length]},
                         ignore_index=True)
        df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1][max_item_length:]},
                         ignore_index=True)
    except:
        df2 = df2.append({"fake_index": int(df.iloc[i, 0]), "items": df.iloc[i, 1]}, ignore_index=True)

df = df2

print(df)

Input:

   fake_index            items
0           1  [a, b, c, d, e]
1           2  [f, g, h, i, j]
2           3           [k, l]

Output:

   fake_index      items
0           1  [a, b, c]
1           1     [d, e]
2           2  [f, g, h]
3           2     [i, j]
4           3     [k, l]

CodePudding user response:

Using higher-order functions transform filter along with slice, you can split the array into sub arrays of size 20 then explode it:

val l = 20

val df1 = df.withColumn(
  "items",
  explode(
    expr(
      s"filter(transform(items, (x,i)-> IF(i%$l=0, slice(items,i 1,$l), null)), x-> x is not null)"
    )
  )
)
  • Related