How I can modify the following code:
c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s ')
arr_of_struct = F.transform(
c1_arr,
lambda x: F.struct(
F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s '))).alias('cnt'),
x.alias('val'),
)
)
top_val = F.sort_array(arr_of_struct, False)[0]
top_val
gives me an element of col1
where it has the most common token with col2
's element. I want to get an element of col1
where it has the most common token with col2
's element and it has the shortest length or token.
For example consider the following data:
col1 col2
["come and get", "computer", come and get more" ] "come for good"
["summer is hot", "summer is too hot", "hot weather"] "hot tea"
["summer is hot", "summer is too hot", "hot weather"] "hot summer"
Desired output:
col1 col2 match
["come and get", "computer", come and get more" ] "come for good" "come and get"
["summer is hot", "summer is too hot", "hot weather"] "hot tea" "hot weather"
["summer is hot", "summer is too hot", "hot weather"] "hot summer" "summer is hot"
Then I use the following code to get my desired result:
df = df.select(
'*',
F.when(((top_val['cnt'] > 1) & (F.size(c2_arr) > 1))|((top_val['cnt'] > 0) & (F.size(c2_arr) == 1))|((F.size(F.split(F.trim(top_val['val']),'\s '))==1) &(top_val['cnt']>0 )),top_val['val']).alias('match'))
The first condition says that the match will be selected if the intersection is more than 1 and the size of co2 is also larger than 1.
the second condition says that if the size of col2
is equal 1 then intersection needs to be non-zero
And the third condition says if the intersection is 1 and an element of col1 is also equal to one, the element is selected.
CodePudding user response:
First of all, input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(["come and get", "computer", "come and get more"], "come for good"),
(["summer is hot", "summer is too hot", "hot weather"], "hot tea"),
(["summer is hot", "summer is too hot", "hot weather"], "hot summer")],
['col1', 'col2'])
For the answer, along the process, I've made 2 options...
You'll want to use this option, because it is way cleaner. In order to make both struct fields sort ascending, I've added the -
sign before the count of matching tokens. When both field order direction is the same, the code does not need much alterations:
c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s ')
arr_of_struct = F.transform(
c1_arr,
lambda x: F.struct(
F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s '))).alias('cnt'),
(-F.size(F.split(F.trim(x), '\s '))).alias('len'),
x.alias('val'),
)
)
top_val = F.sort_array(arr_of_struct, False)[0]
df = df.withColumn('match', top_val.val)
df.show(truncate=0)
# ----------------------------------------------- ------------- -------------
# |col1 |col2 |match |
# ----------------------------------------------- ------------- -------------
# |[come and get, computer, come and get more] |come for good|come and get |
# |[summer is hot, summer is too hot, hot weather]|hot tea |hot weather |
# |[summer is hot, summer is too hot, hot weather]|hot summer |summer is hot|
# ----------------------------------------------- ------------- -------------
Option #2 - for academic use ;)
Make results sorted on:
- one struct's field - descending
- other struct's field - ascending
Since order directions are different, I've created the custom sorting function (so-called comparator function). It looks at "cnt" field of the struct (and sorts descending). But if "cnt" field values are equal, then the function looks at "len" field (and sorts ascending).
c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s ')
df = df.withColumn(
'match',
F.transform(
c1_arr,
lambda x: F.struct(
F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s '))).alias('cnt'),
F.size(F.split(F.trim(x), '\s ')).alias('len'),
x.alias('val'),
)
)
)
df = df.withColumn('match', F.expr("""
array_sort(
match,
(l, r) -> case when l.cnt > r.cnt then -1
when l.cnt < r.cnt then 1
when l.len < r.len then -1
when l.len > r.len then 1
else 0
end)
""")[0].val)
df.show(truncate=0)
# ----------------------------------------------- ------------- -------------
# |col1 |col2 |match |
# ----------------------------------------------- ------------- -------------
# |[come and get, computer, come and get more] |come for good|come and get |
# |[summer is hot, summer is too hot, hot weather]|hot tea |hot weather |
# |[summer is hot, summer is too hot, hot weather]|hot summer |summer is hot|
# ----------------------------------------------- ------------- -------------
We can only do this in SQL API, because array_sort
in Python API does not have the function parameter.