Pyspark: regex search with text in a list withColumn-CodePudding

I am new to Spark and I am having a silly "what's-the-best-approach" issue. Basically, I have a map(dict) that I would like to loop over. During each iteration, I want to search through a column in a spark dataframe using rlike regex and assign the key of the dict to a new column using withColumn

maps = {"groceries": ["hot chocolate", "milk", "sugar", "flour"],
        "laundry": ["soap", "detergent", "fabric softener"]
}

The data sample is shown below

 -------------------- ----------- 
|                  id|item_bought|
 -------------------- ----------- 
|uiq7Zq52Bww4pZXc3xri|       Soap|
|fpJatwxTeObcbuJH25UI|  Detergent|
|MdK1q5gBygIGFYyvbz8J|       Milk|
 -------------------- -----------

I want to get a dataframe that looks like this:

 -------------------- ----------- --------- 
|                  id|item_bought|    class|
 -------------------- ----------- --------- 
|uiq7Zq52Bww4pZXc3xri|       Soap|  Laundry|
|fpJatwxTeObcbuJH25UI|  Detergent|  Laundry|
|MdK1q5gBygIGFYyvbz8J|       Milk|Groceries|
 -------------------- ----------- ---------

I have over 100M records and I want an approach that uses Spark's best practices (distributed computing). One approach that comes to mind is to loop through the map and use rlike or str.contains for the regex search as shown below:

for key, value in maps.items():
    pattern = '|'.join([f'(?i){x}' for x in value]). # ignore case
    df.withColumn("class", col("item_bought").rlike(pattern))

But this returns true or false for the rlike search. I want to substitute the true or false with the key value.

Also, considering that I have 100M (up to 150M) records, is looping through the map the best approach?

EDIT

What if the items_bought in the df had special characters (or some extra text)?

 -------------------- ---------------- 
|                  id|     item_bought|
 -------------------- ---------------- 
|uiq7Zq52Bww4pZXc3xri|   Soap -&ju10kg|
|fpJatwxTeObcbuJH25UI|Detergent x.ju2i|
|MdK1q5gBygIGFYyvbz8J|            Milk|
 -------------------- ----------------

I don't wanna do a text clean up first, just assign classes based on regex keyword search

CodePudding user response：

With you situation, I will turn the map into a dataframe. I assume the resultant dataframe will be relatively small. Use abroadcast join. What this does is that it distribute the small df to each worker node avoiding a shuffle.

#Create df from maps
    df_ref = spark.createDataFrame(maps.items(), schema =('class','item_bought')).withColumn('item_bought',explode('item_bought')).withColumn('item_bought', initcap('item_bought'))

#Broadcast join    
    df.join(broadcast(df_ref), how='left', on='item_bought').show()


 ----------- -------------------- --------- 
|item_bought|                  id|    class|
 ----------- -------------------- --------- 
|       Soap|uiq7Zq52Bww4pZXc3xri|  laundry|
|  Detergent|fpJatwxTeObcbuJH25UI|  laundry|
|       Milk|MdK1q5gBygIGFYyvbz8J|groceries|
 ----------- -------------------- ---------

Following your edit

df_ref = spark.createDataFrame(maps.items(), schema =('class','item_bought1')).withColumn('item_bought1',explode('item_bought1')).withColumn('item_bought1', initcap('item_bought1'))


df.withColumn('item_bought1',regexp_extract('item_bought','^[A-Za-z] ',0)).join(broadcast(df_ref), how='left', on='item_bought1').show()

 ------------ -------------------- ---------------- --------- 
|item_bought1|                  id|     item_bought|    class|
 ------------ -------------------- ---------------- --------- 
|        Soap|uiq7Zq52Bww4pZXc3xri|            Soap|  laundry|
|   Detergent|fpJatwxTeObcbuJH25UI|       Detergent|  laundry|
|        Milk|MdK1q5gBygIGFYyvbz8J|            Milk|groceries|
|        Soap|uiq7Zq52Bww4pZXc3xri|   Soap -&ju10kg|  laundry|
|   Detergent|fpJatwxTeObcbuJH25UI|Detergent x.ju2i|  laundry|
|        Milk|MdK1q5gBygIGFYyvbz8J|            Milk|groceries|

------------ -------------------- ---------------- ---------