How to check if at least one element of a list is included in a list column?-CodePudding

Some mock data:

test_data = [('1', '[tech, fx]'),
             ('2', '[industry, computer]'),
             ('3', '[5G, Apple]')]
test_data = spark.sparkContext.parallelize(test_data).toDF(['id', 'text'])

I wan't to create a new column that indicates if the words ['fx', 'computer'] can be found in the column called 'text':

Desired output:

result = [('1', '[tech, fx]', '1'),
          ('2', '[industry, computer]', '1'),
          ('3', '[5G, Apple]', '0')]
result = spark.sparkContext.parallelize(result).toDF(['id', 'text', 'indicator'])

CodePudding user response：

You can do it using higher-order function exists

from pyspark.sql import functions as F

arr_col = F.split(F.expr("TRIM(BOTH '[]' FROM text)"), ", ?")
bln_col = F.exists(arr_col, lambda x: x.isin(['fx', 'computer']))
result = test_data.withColumn('indicator', bln_col.cast('int'))

result.show()
#  --- -------------------- --------- 
# | id|                text|indicator|
#  --- -------------------- --------- 
# |  1|          [tech, fx]|        1|
# |  2|[industry, computer]|        1|
# |  3|         [5G, Apple]|        0|
#  --- -------------------- ---------

Since your "text" column was in string format, the arr_col expression transforms it to an array first. You need an array, as exists only accepts array type columns.

trim removes the symbols [] from both ends of the string. split splits on the delimiter which is a regex pattern , ?, which means that it splits on every comma , which could (or could not) be followed by one space.

CodePudding user response：

Use array intersect:

import pyspark.sql.functions as f

test_data.withColumn(
    "indicator",
    f.when(
        f.size(f.array_intersect(f.col("text"), f.array(f.lit("fx"), f.lit("computer")))) > 0,1
    ).otherwise(
        f.when(f.col("headline").startswith("Tonight"), 1).otherwise(0)
    )
).show()

 --- -------------------- ------------- --------- 
| id|                text|     headline|indicator|
 --- -------------------- ------------- --------- 
|  1|          [tech, fx]|  Todays Tags|        1|
|  2|[industry, computer]|  Weekly Tags|        1|
|  3|         [5G, Apple]|      General|        0|
|  4|         [5G, Apple]|Tonights News|        1|
 --- -------------------- ------------- ---------

This would also mean you have the text column as ArrayType. OP also wanted to check if headline starts with Tonight.