Check if a word is a part of another word-CodePudding

I have two columns with some text:

text_1	text_2
astro lumen cosm planet	microcosm astronomy planet magnitude

I need to remove a word from column text_1 if this word occurs in text_2 column (i.e., is a complete duplicate) or is a part of some word in text_2 column.

Desired output:

text_1	text_2
lumen	microcosm astronomy planet magnitude

How can I do this in PostgreSQL and/or PySpark?

CodePudding user response：

You can split the first column into array of words then filter the array using filter function like this:

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [("astro lumen cosm planet", "microcosm astronomy planet magnitude")],
    ["text_1", "text_2"]
)

df1 = df.withColumn(
    "text_1",
    F.array_join(
        F.filter(F.split("text_1", "\\s "), lambda x: ~F.col("text_2").contains(x)),
        " "
    )
)

df1.show(truncate=False)
# ------ ------------------------------------ 
#|text_1|text_2                              |
# ------ ------------------------------------ 
#|lumen |microcosm astronomy planet magnitude|
# ------ ------------------------------------

Note that for spark before 3.1 , you need to use expr for higher order function filter

CodePudding user response：

Here is a way to do it in SQL:

WITH data AS (
   SELECT 'astro lumen cosm planet' AS needles,
          'microcosm astronomy planet magnitude' AS haystack
)
SELECT string_agg(needle.n, ' ')
FROM data
   CROSS JOIN LATERAL regexp_split_to_table(data.needles, '  ') AS needle(n)
WHERE strpos(data.haystack, needle.n) = 0;

 string_agg 
════════════
 lumen
(1 row)