I have 2 array columns
| A | B |
| -----------------------|------------------|
| [0.0001,0.0045,0.0031] |[0.0001,0.0031] |
| [2.5,1.0,0.5,3.0] | [2.5, 0.5, 3.0] |
| [1.0,1.0,1.5,1.0] | [1.0,1.0] |
I want to create a new column that is an array of only the values that are not included in both arrays, but I want it to pay attention to instances, for example if .0001 exists in both arrays, but it exists once in one array and twice in the other, I want it to leave 1 remaining instance of .0001.
| A | B | C |
| -----------------------|------------------|-----------------|
| [0.0001,0.0045,0.0031] |[0.0001,0.0031] |[0.0045] |
| [2.5,1.0,0.5,3.0] | [2.5, 0.5, 3.0] |[1.0] |
| [1.0,1.0,1.5,1.0] | [1.0,1.0] |[1.0, 1.5] |
Any ideas how to accomplish this with Pyspark? As of now I have been using df.withColumn("C", F.array_except(F.col("A"), F.col("B"))), but this takes away all instances of the numbers.
CodePudding user response:
Use Counter
from collections
in an udf
to get difference of each value occurrences. using -
instead of subtract
method on counters will omit values with 0 count.
from pyspark.sql import functions as f
from collections import Counter
l = [([0.0001,0.0045,0.0031],[0.0001,0.0031]),([2.5,1.0,0.5,3.0],[2.5, 0.5, 3.0]),([1.0,1.0,1.5,1.0],[1.0,1.0])]
df = spark.createDataFrame(l,"A:array<double>,B:array<double>"))
df.show()
------------------------ ----------------
|A |B |
------------------------ ----------------
|[0.0001, 0.0045, 0.0031]|[0.0001, 0.0031]|
|[2.5, 1.0, 0.5, 3.0] |[2.5, 0.5, 3.0] |
|[1.0, 1.0, 1.5, 1.0] |[1.0, 1.0] |
------------------------ ----------------
@f.udf('array<double>')
def arr_udf(c1,c2):
diff = Counter(c1) - Counter(c2)
return list(diff)
df = df.withColumn('C',arr_udf('A','B'))
df.show()
------------------------ ---------------- ----------
|A |B |C |
------------------------ ---------------- ----------
|[0.0001, 0.0045, 0.0031]|[0.0001, 0.0031]|[0.0045] |
|[2.5, 1.0, 0.5, 3.0] |[2.5, 0.5, 3.0] |[1.0] |
|[1.0, 1.0, 1.5, 1.0] |[1.0, 1.0] |[1.0, 1.5]|
------------------------ ---------------- ----------