I have a Spark dataframe:
> numbers_df
---- ----------- ----------- ----------- -------------------------------------
| id | num_1| num_2| num_3| all_num|
---- ----------- ----------- ----------- -------------------------------------
| 1| [1, 2, 5]| [4, 7]| [8, 3]| [1, 2, 3, 4, 5, 6, 7, 8, 9]|
| 2| [12, 13]| [10, 20]| [15, 17]| [10, 11, 12, 13, 14, 15, 16, 17, 18]|
---- ----------- ----------- ----------- -------------------------------------
I need to except from column all_num
values of num_1
, num_2
and num_3
columns.
Expected result:
id | num_1 | num_2 | num_3 | all_num | except_num |
---|---|---|---|---|---|
1 | [1, 2, 5] | [4, 7] | [8, 3] | [1, 2, 3, 4, 5, 6, 7, 8, 9] | [6, 9] |
2 | [12, 13] | [10, 16] | [15, 17] | [10, 11, 12, 13, 14, 15, 16, 17, 18] | [11, 14, 18] |
How can this be done in PySpark? Since array_except
function takes only two columns as input
CodePudding user response:
You can combine array_except
and concat
functions.
df = df.withColumn('except_num', F.array_except('all_num', F.concat('num_1', 'num_2', 'num_3')))
df.show(truncate=False)