I have my spark dataframe as follow:
target_id other_ids
3733345 [3731634, 3729995, 3728014, 3708332, 3720...
3725312 [3711541, 3726052, 3733763, 900056057, 371...
3717114 [3701718, 3713481, 3715433, 3714825, 3731...
3408996 [3405896, 3250400, 3237054, 3242492, 3256...
3354970 [3354969, 3347893, 3348168, 3353273, 3356...
I want to first shuffle the elements in the arrays in of other_ids
column and then create a new column new_id
where I sample an id from the array of other_ids
column where target_id
is not in other_ids
.
Final result:
target_id other_ids new_id
3733345 [3731634, 3729995, 3728014, 3708332, 3720... 3708332
3725312 [3711541, 3726052, 3733763, 900056057, 371... 900056057
3717114 [3701718, 3713481, 3715433, 3714825, 3731... 3250400
3408996 [3405896, 3250400, 3237054, 3242492, 3256... 3237054
3354970 [3354969, 3347893, 3348168, 3353273, 3356... 3353273
Any suggestions? Thnaks.
CodePudding user response:
You can try this -
df = df.withColumn('new_id', F.element_at(
F.shuffle(
F.array_except(F.col('other_ids'), F.array(F.col('target_id')))
),
1
))