Create a new column in a PySpark dataframe when substring is in a list-CodePudding

I have a dataframe with the id of some restaurants and it looks like this:

 ---------------- -------------------------- 
|  shp_id        |  shop_name               |
 ---------------- -------------------------- 
|  0001          | BOB'S BURGERS            |
|  0002          | DOWNTOWN PIZZA           |
|  0003          | CHARLIE MEATS            |
|  0004          | AMBER'S INDIAN RESTAURANT|
|  0005          | PARIS FRENCH FOOD        |
|  0006          | PIZZA BY ALFREDO         |
 ---------------- --------------------------

I need to label them based on their names. For example: if the restaurant has "pizza" in its name, I'll label its main dish type as pizza.

Basically I need a final dataframe look like this:

 ---------------- -------------------------- ----------- 
|  shp_id        |  shop_name               | DISH_TYPE |
 ---------------- -------------------------- ----------- 
|  0001          | BOB'S BURGERS            | BURGUER   |
|  0002          | DOWNTOWN PIZZA           | PIZZA     |
|  0003          | CHARLIE MEATS            | BUTCHER   |
|  0004          | AMBER'S INDIAN RESTAURANT| INDIAN    |
|  0005          | PARIS FRENCH FOOD        | FRENCH    |
|  0006          | PIZZA BY ALFREDO         | PIZZA     |
 ---------------- -------------------------- -----------

I created some lists with some character that would bring me the right dish type if I used the function like():

pizza_list = ['pizza', 'pizz', 'pizzeria', 'piz']
burger_list = ['hamb', 'burger', 'burg']
indian_list = ['ind', 'indian', 'hindi']

I tried this, but it didn't work:

df_restaurants2 = (
  df_restaurants
  .withColumn("name_lowecase", lower(col("rest_name")))
  .withColumn("dish_type",
              when(col("name_lowecase").isin(col("name_lowecase").like(pizza_list )), "PIZZA")
              .when(col("name_lowecase").isin(col("name_lowecase").like(burger_list )), "BURGUER")
              .when(col("name_lowecase").isin(col("name_lowecase").like(indian_list )), "INDIAN")
              .otherwise("NO LABEL")
             )
)

py4j.Py4JException: Method like([class java.util.ArrayList]) does not exist

CodePudding user response：

Use Regex... Should be easier to extract strings with multiple cases. Eg:

regex_extract(lower(col), "(pizz[earia]{0,4})|(ham|burg[er]{0,2})")

Also, a simple way to find patterns would be split() the string of restaurants, and get the distinct count of words inside the array.

CodePudding user response：

I would suggest regex too, but using rlike and passing a string made from your list by joining all the list items with | symbol.

df_restaurants2 = (
  df_restaurants
  .withColumn("name_lowercase", lower(col("rest_name")))
  .withColumn("dish_type",
              when(col("name_lowercase").rlike('|'.join(pizza_list)), "PIZZA")
              .when(col("name_lowercase").rlike('|'.join(burger_list)), "BURGER")
              .when(col("name_lowercase").rlike('|'.join(indian_list)), "INDIAN")
              .otherwise("NO LABEL")
             )
)