I have the below list.
lst=['name','age','country']
I have the below pyspark
dataframe
column_a column_b
Aaaa name,age,subject
Bbbb name,age,country,subject
Cccc name,subject,percentage
I have to compare the list with column_b
and check if the values in the list are part of the column and create a new column and populate it with the values from the list that are available in column_b
.
Below is the expected output.
column_a column_b column_c
Aaaa name,age,subject name,age
Bbbb name,age,country,subject name,age,country
Cccc name,subject,percentage name
CodePudding user response:
Without Duplicates
array_intersect
allows for the operation you want to achieve.
array_intersect
does not allow for duplicates, (i.e, ) ifcolumn_b
had a value of["name", "name"]
thencolumn_c
would contain["name"]
once.
from pyspark.sql import functions as F
data = [("Aaaa", ["name", "age", "subject"],),
("Bbbb", ["name", "age", "country", "subject"],),
("Cccc", ["name", "subject", "percentage"],),
("Dddd", ["name", "name"],),]
df = spark.createDataFrame(data, ("column_a", "column_b",))
lst=['name','age','country']
lit_lst = [F.lit(v) for v in lst]
df.withColumn("column_c", F.array_intersect(F.col("column_b"), F.array(lit_lst))).show(truncate=False)
Output
-------- ----------------------------- --------------------
|column_a|column_b |column_c |
-------- ----------------------------- --------------------
|Aaaa |[name, age, subject] |[name, age] |
|Bbbb |[name, age, country, subject]|[name, age, country]|
|Cccc |[name, subject, percentage] |[name] |
|Dddd |[name, name] |[name] |
-------- ----------------------------- --------------------
Preserve Duplicates
To preserve duplicates, filter
Higher Order Function can be applied.
from pyspark.sql import functions as F
data = [("Aaaa", ["name", "age", "subject"],),
("Bbbb", ["name", "age", "country", "subject"],),
("Cccc", ["name", "subject", "percentage"],),
("Dddd", ["name", "name"],),]
df = spark.createDataFrame(data, ("column_a", "column_b",))
df.withColumn("column_c", F.array(lit_lst))\
.withColumn("column_c", F.expr("filter(column_b, element -> array_contains(column_c, element))"))\
.show(truncate=False)
Output
-------- ----------------------------- --------------------
|column_a|column_b |column_c |
-------- ----------------------------- --------------------
|Aaaa |[name, age, subject] |[name, age] |
|Bbbb |[name, age, country, subject]|[name, age, country]|
|Cccc |[name, subject, percentage] |[name] |
|Dddd |[name, name] |[name, name] |
-------- ----------------------------- --------------------