I have a dataframe with the id of some restaurants and it looks like this:
---------------- --------------------------
| shp_id | shop_name |
---------------- --------------------------
| 0001 | BOB'S BURGERS |
| 0002 | DOWNTOWN PIZZA |
| 0003 | CHARLIE MEATS |
| 0004 | AMBER'S INDIAN RESTAURANT|
| 0005 | PARIS FRENCH FOOD |
| 0006 | PIZZA BY ALFREDO |
---------------- --------------------------
I need to label them based on their names. For example: if the restaurant has "pizza" in its name, I'll label its main dish type as pizza.
Basically I need a final dataframe look like this:
---------------- -------------------------- -----------
| shp_id | shop_name | DISH_TYPE |
---------------- -------------------------- -----------
| 0001 | BOB'S BURGERS | BURGUER |
| 0002 | DOWNTOWN PIZZA | PIZZA |
| 0003 | CHARLIE MEATS | BUTCHER |
| 0004 | AMBER'S INDIAN RESTAURANT| INDIAN |
| 0005 | PARIS FRENCH FOOD | FRENCH |
| 0006 | PIZZA BY ALFREDO | PIZZA |
---------------- -------------------------- -----------
I created some lists with some character that would bring me the right dish type if I used the function like()
:
pizza_list = ['pizza', 'pizz', 'pizzeria', 'piz']
burger_list = ['hamb', 'burger', 'burg']
indian_list = ['ind', 'indian', 'hindi']
I tried this, but it didn't work:
df_restaurants2 = (
df_restaurants
.withColumn("name_lowecase", lower(col("rest_name")))
.withColumn("dish_type",
when(col("name_lowecase").isin(col("name_lowecase").like(pizza_list )), "PIZZA")
.when(col("name_lowecase").isin(col("name_lowecase").like(burger_list )), "BURGUER")
.when(col("name_lowecase").isin(col("name_lowecase").like(indian_list )), "INDIAN")
.otherwise("NO LABEL")
)
)
py4j.Py4JException: Method like([class java.util.ArrayList]) does not exist
CodePudding user response:
Use Regex... Should be easier to extract strings with multiple cases. Eg:
regex_extract(lower(col), "(pizz[earia]{0,4})|(ham|burg[er]{0,2})")
Also, a simple way to find patterns would be split() the string of restaurants, and get the distinct count of words inside the array.
CodePudding user response:
I would suggest regex too, but using rlike
and passing a string made from your list by joining all the list items with |
symbol.
df_restaurants2 = (
df_restaurants
.withColumn("name_lowercase", lower(col("rest_name")))
.withColumn("dish_type",
when(col("name_lowercase").rlike('|'.join(pizza_list)), "PIZZA")
.when(col("name_lowercase").rlike('|'.join(burger_list)), "BURGER")
.when(col("name_lowercase").rlike('|'.join(indian_list)), "INDIAN")
.otherwise("NO LABEL")
)
)