Remove duplicates from PySpark array column by checking each element-CodePudding

I have a Spark dataframe that contains 2 array columns:

 ------------------------------------------------------ ----------------- 
|                                                  var1|             var2|
 ------------------------------------------------------ ----------------- 
|       [black tea, green tea, tea, yerba mate, oolong]|      [green tea]|
|[milk, toned milk, standardised milk, full cream milk]| [cow or buffalo]|
 ------------------------------------------------------ -----------------

I need to remove duplicates according to the following rules:

Check each element of the column var1 with the value of var2 column and remove words from the var1 that partially (for example, 1 word - tea) or completely (for example, 2 words - green tea) match the var2 value.
If there is a complete match and the element is removed completely from the var1 column, then the extra comma (inside the array or at the end) must also be removed
Also remove repeating words from elements in the var1 column.
For example, if one element contains a word that is then repeated in other elements, these duplicates should be removed (for example, we have a milk, then toned milk, standardized milk, full cream milk - in this case, the desired output looks like this: milk, toned, standardised, full cream)

Required output:

 --------------------------------------- ----------------- 
|                                   var1|             var2|
 --------------------------------------- ----------------- 
|            [black, yerba mate, oolong]|      [green tea]|
|[milk, toned, standardised, full cream]| [cow or buffalo]|
 --------------------------------------- -----------------

CodePudding user response：

Definitely a savanna buffalo not cow :-))

df = (
  #Split var1 and var2 into single words contained in a list and store in temp columns
  df.select('*',*[split(regexp_replace(col(x).cast('string'),'\]|\[|\,',''),'\s').alias(f'{x}_1') for x in df.columns])
     #Leverage the rich array functions to remove words that exists in var2 from var1
      
     .withColumn('var1', array_except('var1','var2_1'))
     .withColumn('var1', array_except('var1','var2'))
).select('var1','var2')


df.show(truncate=False)


df.show(truncate=False)

 ------------------------------------------------------ ---------------- 
|var1                                                  |var2            |
 ------------------------------------------------------ ---------------- 
|[black, yerba mate, oolong]                           |[green tea]     |
|[milk, toned milk, standardised milk, full cream milk]|[cow or buffalo]|
 ------------------------------------------------------ ----------------

CodePudding user response：

Here's one way using arrays higher order functions:

Filter elements in array var1 that match elements in var2 array using filter function
Join array var1, and remove duplicate words using regex, then split again to get array

from pyspark.sql import functions as F

df1 = df.withColumn(
    "var1", 
    F.filter("var1", lambda x: ~F.exists("var2", lambda y: (x == y) | y.contains(x)))
).withColumn(
    "var1", 
    F.regexp_replace(F.array_join(F.reverse("var1"), "#"), r"\b(\w )\b(?=.*\b\1\b)", "")
).withColumn(
    "var1", 
    F.transform(F.reverse(F.split("var1", "#")), lambda x: F.trim(x))
)

df1.show(truncate=False)
#  --------------------------------------- ---------------- 
# |var1                                   |var2            |
#  --------------------------------------- ---------------- 
# |[black tea, yerba mate, oolong]        |[green tea]     |
# |[milk, toned, standardised, full cream]|[cow or buffalo]|
#  --------------------------------------- ----------------