Optimize the code when creating new columns based on the number of token of the row-CodePudding

Is there any way to optimize the following code?

c1_arr = F.col('col1')
c2_arr = F.split('col2', ' ')
arr_of_struct = F.transform(
    c1_arr,
    lambda x: F.struct(
        F.size(F.array_intersect(c2_arr, F.split(x, ' '))).alias('cnt'),
        x.alias('val'),
    )
)
top_val = F.sort_array(arr_of_struct, False)[0]


df = df.withColumn('match', F.when((top_val['cnt'] > 1)&(F.size(F.split(F.col('col2'), ' '))>1), top_val['val']))

df = df.withColumn('match_one_token', F.when((top_val['cnt'] > 0 )&(F.size(F.split(F.col('col2'), ' '))==1), top_val['val']))

The arr_of_struct makes each row of col1 a list with separating by ',' .

1st part

The following code selects an element of col2 where is has the most common token with tokens of col1 (the intersection is larger than 1 and the size of the element of col2 is larger than one).

df = df.withColumn('match', F.when((top_val['cnt'] > 1)&(F.size(F.split(F.col('col2'), ' '))>1), top_val['val']))

2nd part

The following code selects an element of col2 where is has a common token with tokens of col1 (the intersection is equal 1 and the size of the element of col2 is equal one).

df = df.withColumn('match_one_token', F.when((top_val['cnt'] > 0 )&(F.size(F.split(F.col('col2'), ' '))==1), top_val['val']))

Example:

key     col1                                                         col2
ab    'summer hot, best friend, not possible, apple, let it go'      "let be hot"
cd     'do it better', 'I am sa'                                     "I need to go"
fg      'my best post, fun sunday'                                   "it's great"

Output:

key     col1                                                         col2            match     match_one_token
ab    'summer hot, best friend, not possible, apple, let it go'    "let be hot"    "let it go"  NA
cd     'do it better, I am sa'                                  "I need it do sa"    "I am sa"  NA
fg      'my best post, fun sunday'                                    "it's great"       NA   NA
gh        'my best, lets go'                                           'best'           NA       'best'

The first row has a common token with both summer hot and let it go, but it doesn't matter which one is selected.

The problem is that the 1st part is fast enough but when I do the 2nd part it takes a lot of time.

CodePudding user response：

You could try the following. I've changed something in several parts of the code, so I hope this will work.

Minor thing which you may not even need - knowing the history of this question, I've changed c1_arr and c2_arr (now the code in this question can at least run).
```
c1_arr = F.split('col1', ',\s*')
c2_arr = F.split('col2', '\s ')
```

The main change which may boost the performance slightly is removing withColumn and using select.
So, instead of

df = df.withColumn('match', F.when((top_val['cnt'] > 1)&(F.size(F.split(F.col('col2'), ' '))>1), top_val['val']))
df = df.withColumn('match_one_token', F.when((top_val['cnt'] > 0 )&(F.size(F.split(F.col('col2'), ' '))==1), top_val['val']))

you could use

df = df.select(
    '*',
    F.when((top_val['cnt'] > 1) & (F.size(c2_arr) > 1), top_val['val']).alias('match'),
    F.when((top_val['cnt'] > 0) & (F.size(c2_arr) == 1), top_val['val']).alias('match_one_token')
)

Full code

Modified input dataframe:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('ab', 'summer hot, best friend, not possible, apple, let it go', "let go hot"),
     ('cd', 'do it better, I am sa', "I need to do sa"),
     ('fg', 'my best post, fun sunday', "it's great"),
     ('gh', 'my best, lets go', "best")],
    ['key', 'col1', 'col2'])
df.show(truncate=0)
#  --- ------------------------------------------------------- --------------- 
# |key|col1                                                   |col2           |
#  --- ------------------------------------------------------- --------------- 
# |ab |summer hot, best friend, not possible, apple, let it go|let go hot     |
# |cd |do it better, I am sa                                  |I need to do sa|
# |fg |my best post, fun sunday                               |it's great     |
# |gh |my best, lets go                                       |best           |
#  --- ------------------------------------------------------- ---------------

Script:

c1_arr = F.split('col1', ',\s*')
c2_arr = F.split('col2', '\s ')
arr_of_struct = F.transform(
    c1_arr,
    lambda x: F.struct(
        F.size(F.array_intersect(c2_arr, F.split(x, ' '))).alias('cnt'),
        x.alias('val'),
    )
)
top_val = F.sort_array(arr_of_struct, False)[0]

df = df.select(
    '*',
    F.when((top_val['cnt'] > 1) & (F.size(c2_arr) > 1), top_val['val']).alias('match'),
    F.when((top_val['cnt'] > 0) & (F.size(c2_arr) == 1), top_val['val']).alias('match_one_token')
)

df.show(truncate=0)
#  --- ------------------------------------------------------- --------------- --------- --------------- 
# |key|col1                                                   |col2           |match    |match_one_token|
#  --- ------------------------------------------------------- --------------- --------- --------------- 
# |ab |summer hot, best friend, not possible, apple, let it go|let go hot     |let it go|null           |
# |cd |do it better, I am sa                                  |I need to do sa|I am sa  |null           |
# |fg |my best post, fun sunday                               |it's great     |null     |null           |
# |gh |my best, lets go                                       |best           |null     |my best        |
#  --- ------------------------------------------------------- --------------- --------- ---------------