Some mock data:
test_data = [('1', '[tech, fx]'),
('2', '[industry, computer]'),
('3', '[5G, Apple]')]
test_data = spark.sparkContext.parallelize(test_data).toDF(['id', 'text'])
I wan't to create a new column that indicates if the words ['fx', 'computer'] can be found in the column called 'text':
Desired output:
result = [('1', '[tech, fx]', '1'),
('2', '[industry, computer]', '1'),
('3', '[5G, Apple]', '0')]
result = spark.sparkContext.parallelize(result).toDF(['id', 'text', 'indicator'])
CodePudding user response:
You can do it using higher-order function exists
from pyspark.sql import functions as F
arr_col = F.split(F.expr("TRIM(BOTH '[]' FROM text)"), ", ?")
bln_col = F.exists(arr_col, lambda x: x.isin(['fx', 'computer']))
result = test_data.withColumn('indicator', bln_col.cast('int'))
result.show()
# --- -------------------- ---------
# | id| text|indicator|
# --- -------------------- ---------
# | 1| [tech, fx]| 1|
# | 2|[industry, computer]| 1|
# | 3| [5G, Apple]| 0|
# --- -------------------- ---------
Since your "text" column was in string format, the arr_col
expression transforms it to an array first. You need an array, as exists
only accepts array type columns.
trim
removes the symbols []
from both ends of the string. split
splits on the delimiter which is a regex pattern , ?
, which means that it splits on every comma ,
which could (or could not) be followed by one space.
CodePudding user response:
Use array intersect
:
import pyspark.sql.functions as f
test_data.withColumn(
"indicator",
f.when(
f.size(f.array_intersect(f.col("text"), f.array(f.lit("fx"), f.lit("computer")))) > 0,1
).otherwise(
f.when(f.col("headline").startswith("Tonight"), 1).otherwise(0)
)
).show()
--- -------------------- ------------- ---------
| id| text| headline|indicator|
--- -------------------- ------------- ---------
| 1| [tech, fx]| Todays Tags| 1|
| 2|[industry, computer]| Weekly Tags| 1|
| 3| [5G, Apple]| General| 0|
| 4| [5G, Apple]|Tonights News| 1|
--- -------------------- ------------- ---------
This would also mean you have the text
column as ArrayType
.
OP also wanted to check if headline
starts with Tonight
.