Home > Back-end >  PySpark - Check if column of strings contain words in a list of string and extract them
PySpark - Check if column of strings contain words in a list of string and extract them

Time:10-15

I would like to check if items in my lists are in the strings in my column, and know which of them.

Let say I have a PySpark Dataframe containing id and description with 25M rows like this: enter image description here

And I have a list of strings like this :

technos = ["SQL", "NodeJS", "R", "C  ", "Google Cloud"...]

I would like to check, for each description in my dataframe, which items from the technos list are found.

In output, I would like something like: enter image description here

Everything I tried until now has failed.

I tried using UDF and a python function with a for loop, but since it doesn’t leverage Spark distributed computing it can't scale with big amount of row.

I recently tried using pyspark.sql.Column.contains followed by pyspark.sql.DataFrame.filter but the filter step is taking so long just for one "techno" and 1M description.

Is there a way to optimize the filter ? I saw something with .join but I don't understand how it can work in my case.

Is it even possible to optimize such a processing with Spark ?

CodePudding user response:

This solution is done in Scala, but the same logic can be applied on Python as well (very simple syntax);

First of all, concatenate your original list: [SQL, NodeJS, R, C\\\\ \\\\ ] to something like this: (SQL)|(NodeJS)|(R)|(C\\\\ \\\\ ).

In Scala, I use: Array.mkString("\"(", ")|(", ")\"") and I store this in a variable, say expr.

Then, assume the dataset is called df and contains:

 --- ---------------------------------------- 
|id |desc                                    |
 --- ---------------------------------------- 
|1  |Being SQL master and knowing basics of R|
|2  |Mastering Python and C                  |
|3  |Nothing                                 |
 --- ---------------------------------------- 

You can use regexp_extract_all as below:

df.withColumn("technos_found", expr(s"regexp_extract_all(desc, $expr, 0)"))

And gives this:

 --- ---------------------------------------- ------------- 
|id |desc                                    |technos_found|
 --- ---------------------------------------- ------------- 
|1  |Being SQL master and knowing basics of R|[SQL, R]     |
|2  |Mastering Python and C                  |[C  ]        |
|3  |Nothing                                 |[]           |
 --- ---------------------------------------- ------------- 

NOTE: The \\\\ in symbol is very important, as is used for other purposes in regex.

I hope this is what you need, good luck!

CodePudding user response:

from pyspark.sql.types import StringType
from pyspark.sql import functions as F
technos = ["SQL", "NodeJS", "R", "C  ", "Python"]
phrases = ['mastering C   Python', 'Being SQL master', "working with R"]
df = spark.createDataFrame(technos, StringType())
df1 = spark.createDataFrame(phrases, StringType())

df_exploded = df1.withColumn("items", F.explode(F.split(F.col("value"), ' ')))
df_exploded.show()

 -------------------- --------- 
|               value|    items|
 -------------------- --------- 
|mastering C   Python|mastering|
|mastering C   Python|      C  |
|mastering C   Python|   Python|
|    Being SQL master|    Being|
|    Being SQL master|      SQL|
|    Being SQL master|   master|
|      working with R|  working|
|      working with R|     with|
|      working with R|        R|
 -------------------- --------- 


df_exploded.join(F.broadcast(df), df.value == df_exploded.items).groupBy(df_exploded.value).agg(F.collect_list(df.value)).show()

 -------------------- ------------------- 
|               value|collect_list(value)|
 -------------------- ------------------- 
|      working with R|                [R]|
|mastering C   Python|      [C  , Python]|
|    Being SQL master|              [SQL]|
 -------------------- ------------------- 

Here is an example, the key to optimize your performance is to broadcast your technos which I hope won't be very large and you use join to collect all the keys present in your description

  • Related