I have two columns with some text:
text_1 | text_2 |
---|---|
astro lumen cosm planet | microcosm astronomy planet magnitude |
I need to remove a word from column text_1
if this word occurs in text_2
column (i.e., is a complete duplicate) or is a part of some word in text_2
column.
Desired output:
text_1 | text_2 |
---|---|
lumen | microcosm astronomy planet magnitude |
How can I do this in PostgreSQL and/or PySpark?
CodePudding user response:
You can split the first column into array of words then filter the array using filter
function like this:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("astro lumen cosm planet", "microcosm astronomy planet magnitude")],
["text_1", "text_2"]
)
df1 = df.withColumn(
"text_1",
F.array_join(
F.filter(F.split("text_1", "\\s "), lambda x: ~F.col("text_2").contains(x)),
" "
)
)
df1.show(truncate=False)
# ------ ------------------------------------
#|text_1|text_2 |
# ------ ------------------------------------
#|lumen |microcosm astronomy planet magnitude|
# ------ ------------------------------------
Note that for spark before 3.1 , you need to use expr
for higher order function filter
CodePudding user response:
Here is a way to do it in SQL:
WITH data AS (
SELECT 'astro lumen cosm planet' AS needles,
'microcosm astronomy planet magnitude' AS haystack
)
SELECT string_agg(needle.n, ' ')
FROM data
CROSS JOIN LATERAL regexp_split_to_table(data.needles, ' ') AS needle(n)
WHERE strpos(data.haystack, needle.n) = 0;
string_agg
════════════
lumen
(1 row)