How to create Boolean if a Pyspark column string is found in a list of strings?-CodePudding

I have a Spark Dataframe that has a column containing strings. These strings are referencing beverages, but can also include amounts / volumes / etc (there is no consistency so a regular expression can help clean this up, but can not resolve this). As a way to circumvent that I was hoping to use a filter to determine if the column string is found in a list and then generate a new column with a boolean, but am not sure the best way to do so.

I tried using case-when logic but that did not work.

I prefer contains because it can account for not exact matching which isin will require.

data = [
    [
        1,
        "SODA",
    ],
    [
        2,
        "JUICE 1L",
    ],
    [
        3,
        "WATER 64OZ",
    ],
    [
        4,
        "HOT TEA",
    ],
]

df = pd.DataFrame(data, columns=["ID", "Beverage"])

DRINK_LIST = ["SODA", "WATER", "COFFEE", "TEA", "JUICE"]
sdf = spark.createDataFrame(df)

Does anyone know the best way to do this?

CodePudding user response：

Assume this is your drinks array (or list):

val drinks = Array("SODA", "WATER", "COFFEE", "TEA", "JUICE")

We can convert this to a regex expression so we can apply it in rlike API:

val regex = drinks.map(x => "("   x   ")").mkString("|").toLowerCase()

which gives (soda)|(water)|(coffee)|(tea)|(juice).

Then, we can apply this to the dataset, like:

df = df.withColumn("is_within", rlike(lower(beverage), regex))

PS: converting the column to lowercase but also your list, just to be sure that everything is compatible.

I would rather use rlike instead of contains, since I am not sure if we can come to the same result using the latter without doing individual checks.

Good luck!

CodePudding user response：

%python

import pandas as pd
import pyspark.sql.functions as py 

data = [[1, 'SODA'], [2, 'JUICE'], [3, 'WATER'], [4, 'HOT STEA']]

df = pd.DataFrame(data, columns=['ID', 'Beverage'])
sdf = spark.createDataFrame(df)
sdf.show()

DRINK_LIST = ["SODA", "WATER", "COFFEE", "TEA", "JUICE"]

# May need upper or lowercase conversion. Different to scala api.
sdf = sdf.withColumn('check', py.when(py.length(py.regexp_extract('Beverage', '(?=^|\s)('   '|'.join(DRINK_LIST)   ')(?=\s|$)', 0)) > 0, True).otherwise(False))    
sdf.show()

returns:

 --- -------- ----- 
| ID|Beverage|check|
 --- -------- ----- 
|  1|    SODA| true|
|  2|   JUICE| true|
|  3|   WATER| true|
|  4|HOT STEA|false|
 --- -------- -----