I have a Spark dataframe that contains 2 array columns:
------------------------------------------------------ -----------------
| var1| var2|
------------------------------------------------------ -----------------
| [black tea, green tea, tea, yerba mate, oolong]| [green tea]|
|[milk, toned milk, standardised milk, full cream milk]| [cow or buffalo]|
------------------------------------------------------ -----------------
I need to remove duplicates according to the following rules:
- Check each element of the column
var1
with the value ofvar2
column and remove words from thevar1
that partially (for example, 1 word -tea
) or completely (for example, 2 words -green tea
) match thevar2
value. - If there is a complete match and the element is removed completely from the
var1
column, then the extra comma (inside the array or at the end) must also be removed - Also remove repeating words from elements in the
var1
column.
For example, if one element contains a word that is then repeated in other elements, these duplicates should be removed (for example, we have amilk
, thentoned milk
,standardized milk
,full cream milk
- in this case, the desired output looks like this:milk, toned, standardised, full cream
)
Required output:
--------------------------------------- -----------------
| var1| var2|
--------------------------------------- -----------------
| [black, yerba mate, oolong]| [green tea]|
|[milk, toned, standardised, full cream]| [cow or buffalo]|
--------------------------------------- -----------------
CodePudding user response:
Definitely a savanna buffalo not cow :-))
df = (
#Split var1 and var2 into single words contained in a list and store in temp columns
df.select('*',*[split(regexp_replace(col(x).cast('string'),'\]|\[|\,',''),'\s').alias(f'{x}_1') for x in df.columns])
#Leverage the rich array functions to remove words that exists in var2 from var1
.withColumn('var1', array_except('var1','var2_1'))
.withColumn('var1', array_except('var1','var2'))
).select('var1','var2')
df.show(truncate=False)
df.show(truncate=False)
------------------------------------------------------ ----------------
|var1 |var2 |
------------------------------------------------------ ----------------
|[black, yerba mate, oolong] |[green tea] |
|[milk, toned milk, standardised milk, full cream milk]|[cow or buffalo]|
------------------------------------------------------ ----------------
CodePudding user response:
Here's one way using arrays higher order functions:
- Filter elements in array
var1
that match elements invar2
array usingfilter
function - Join array
var1
, and remove duplicate words using regex, then split again to get array
from pyspark.sql import functions as F
df1 = df.withColumn(
"var1",
F.filter("var1", lambda x: ~F.exists("var2", lambda y: (x == y) | y.contains(x)))
).withColumn(
"var1",
F.regexp_replace(F.array_join(F.reverse("var1"), "#"), r"\b(\w )\b(?=.*\b\1\b)", "")
).withColumn(
"var1",
F.transform(F.reverse(F.split("var1", "#")), lambda x: F.trim(x))
)
df1.show(truncate=False)
# --------------------------------------- ----------------
# |var1 |var2 |
# --------------------------------------- ----------------
# |[black tea, yerba mate, oolong] |[green tea] |
# |[milk, toned, standardised, full cream]|[cow or buffalo]|
# --------------------------------------- ----------------