I have a table like this:
a | a_vector |
1 | 710.83;-776.98;-10.86;2013.02;-896.28; |
2 | 3 ; 2 ; 1 |
Using PySpark/pandas, how do I dynamically create columns so that first values in vector go to "col1" and second values go to "col2" etc. calculate the sum?
a | a_vector | col1 | col2 | col3
1 | 300;-200;2022; | 300 | -200 | 2022
2 | 3 ; 2 ; 1 | 3 | 2 | 1
CodePudding user response:
with names
dfs = df['a_vector'].str.split(';', expand=True).rename(columns = lambda x: "col" str(x 1))
df =pd.concat([df, dfs], axis=1)
print(df)
inputs
a a_vector
0 1 300;-200;2022
1 2 3;2;1
Output
a a_vector col1 col2 col3
0 1 300;-200;2022 300 -200 2022
1 2 3;2;1 3 2 1
Note: Not tested..Let me know if deosn't work
CodePudding user response:
In PySpark, you can do it by first splitting your string on ;
(creating an array) and then selecting columns using list comprehension. The sum can be calculated using aggregate
higher-order function.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('1', '300;-200;2022'),
('2', '3 ; 2 ; 1')],
['a', 'a_vector']
)
Script:
col_arr = F.split('a_vector', '\s*;\s*')
max_size = df.agg(F.max(F.size(col_arr))).head()[0]
df = df.select(
'*',
*[col_arr[i].alias(f'col{i}') for i in range(max_size)],
F.aggregate(col_arr, F.lit(0.0), lambda acc, x: acc x).alias('sum')
)
df.show()
# --- ------------- ---- ---- ---- ------
# | a| a_vector|col0|col1|col2| sum|
# --- ------------- ---- ---- ---- ------
# | 1|300;-200;2022| 300|-200|2022|2122.0|
# | 2|3 ; 2 ; 1| 3| 2| 1| 6.0|
# --- ------------- ---- ---- ---- ------