Compare 2 arrays and only remove the instances of values that exist in both arrays pyspark-CodePudding

I have 2 array columns

| A                      | B                |
| -----------------------|------------------|
| [0.0001,0.0045,0.0031] |[0.0001,0.0031]   |
| [2.5,1.0,0.5,3.0]      | [2.5, 0.5, 3.0]  |
| [1.0,1.0,1.5,1.0]      | [1.0,1.0]        |

I want to create a new column that is an array of only the values that are not included in both arrays, but I want it to pay attention to instances, for example if .0001 exists in both arrays, but it exists once in one array and twice in the other, I want it to leave 1 remaining instance of .0001.

| A                      | B                | C               |
| -----------------------|------------------|-----------------|
| [0.0001,0.0045,0.0031] |[0.0001,0.0031]   |[0.0045]         |    
| [2.5,1.0,0.5,3.0]      | [2.5, 0.5, 3.0]  |[1.0]            |
| [1.0,1.0,1.5,1.0]      | [1.0,1.0]        |[1.0, 1.5]       |

Any ideas how to accomplish this with Pyspark? As of now I have been using df.withColumn("C", F.array_except(F.col("A"), F.col("B"))), but this takes away all instances of the numbers.

CodePudding user response：

Use Counter from collections in an udf to get difference of each value occurrences. using - instead of subtract method on counters will omit values with 0 count.

from pyspark.sql import functions as f
from collections import Counter

l = [([0.0001,0.0045,0.0031],[0.0001,0.0031]),([2.5,1.0,0.5,3.0],[2.5, 0.5, 3.0]),([1.0,1.0,1.5,1.0],[1.0,1.0])]
df = spark.createDataFrame(l,"A:array<double>,B:array<double>"))
df.show()
 ------------------------ ---------------- 
|A                       |B               |
 ------------------------ ---------------- 
|[0.0001, 0.0045, 0.0031]|[0.0001, 0.0031]|
|[2.5, 1.0, 0.5, 3.0]    |[2.5, 0.5, 3.0] |
|[1.0, 1.0, 1.5, 1.0]    |[1.0, 1.0]      |
 ------------------------ ---------------- 


@f.udf('array<double>')
def arr_udf(c1,c2):
    diff = Counter(c1) - Counter(c2)
    return list(diff)

df = df.withColumn('C',arr_udf('A','B'))
df.show()
 ------------------------ ---------------- ---------- 
|A                       |B               |C         |
 ------------------------ ---------------- ---------- 
|[0.0001, 0.0045, 0.0031]|[0.0001, 0.0031]|[0.0045]  |
|[2.5, 1.0, 0.5, 3.0]    |[2.5, 0.5, 3.0] |[1.0]     |
|[1.0, 1.0, 1.5, 1.0]    |[1.0, 1.0]      |[1.0, 1.5]|
 ------------------------ ---------------- ----------