Home > Mobile >  Pyspark sample value from array column
Pyspark sample value from array column

Time:03-26

I have my spark dataframe as follow:

target_id   other_ids
3733345     [3731634, 3729995, 3728014, 3708332, 3720...
3725312     [3711541, 3726052, 3733763, 900056057, 371...
3717114     [3701718, 3713481, 3715433, 3714825, 3731...
3408996     [3405896, 3250400, 3237054, 3242492, 3256...
3354970     [3354969, 3347893, 3348168, 3353273, 3356...

I want to first shuffle the elements in the arrays in of other_ids column and then create a new column new_id where I sample an id from the array of other_ids column where target_id is not in other_ids.
Final result:

target_id   other_ids                                      new_id
3733345     [3731634, 3729995, 3728014, 3708332, 3720...   3708332
3725312     [3711541, 3726052, 3733763, 900056057, 371...  900056057
3717114     [3701718, 3713481, 3715433, 3714825, 3731...   3250400
3408996     [3405896, 3250400, 3237054, 3242492, 3256...   3237054
3354970     [3354969, 3347893, 3348168, 3353273, 3356...   3353273

Any suggestions? Thnaks.

CodePudding user response:

You can try this -

df = df.withColumn('new_id', F.element_at(
    F.shuffle(
        F.array_except(F.col('other_ids'), F.array(F.col('target_id')))
    ),
    1
))
  • Related