Dynamically create columns from string with delimiter in Spark-CodePudding

I have a table like this:

a |  a_vector                               |  
1 | 710.83;-776.98;-10.86;2013.02;-896.28;  |  
2 | 3  ; 2  ;  1                            |

Using PySpark/pandas, how do I dynamically create columns so that first values in vector go to "col1" and second values go to "col2" etc. calculate the sum?

a |  a_vector                               |  col1 | col2   | col3
1 | 300;-200;2022;                          |  300  |  -200  | 2022
2 | 3  ; 2  ;  1                            |   3   |  2     | 1

CodePudding user response：

with names

dfs = df['a_vector'].str.split(';', expand=True).rename(columns = lambda x: "col" str(x 1))
df =pd.concat([df, dfs], axis=1)
print(df)

inputs

   a       a_vector
0  1  300;-200;2022
1  2          3;2;1

Output

   a       a_vector col1  col2  col3
0  1  300;-200;2022  300  -200  2022
1  2          3;2;1    3     2     1

Note: Not tested..Let me know if deosn't work

CodePudding user response：

In PySpark, you can do it by first splitting your string on ; (creating an array) and then selecting columns using list comprehension. The sum can be calculated using aggregate higher-order function.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('1', '300;-200;2022'),
     ('2', '3  ; 2  ;   1')],
    ['a', 'a_vector']
)

Script:

col_arr = F.split('a_vector', '\s*;\s*')
max_size = df.agg(F.max(F.size(col_arr))).head()[0]
df = df.select(
    '*',
    *[col_arr[i].alias(f'col{i}') for i in range(max_size)],
    F.aggregate(col_arr, F.lit(0.0), lambda acc, x: acc   x).alias('sum')
)

df.show()
#  --- ------------- ---- ---- ---- ------ 
# |  a|     a_vector|col0|col1|col2|   sum|
#  --- ------------- ---- ---- ---- ------ 
# |  1|300;-200;2022| 300|-200|2022|2122.0|
# |  2|3  ; 2  ;   1|   3|   2|   1|   6.0|
#  --- ------------- ---- ---- ---- ------