I have a Spark Dataframe that has a column containing strings. These strings are referencing beverages, but can also include amounts / volumes / etc (there is no consistency so a regular expression can help clean this up, but can not resolve this). As a way to circumvent that I was hoping to use a filter to determine if the column string is found in a list and then generate a new column with a boolean, but am not sure the best way to do so.
I tried using case-when
logic but that did not work.
I prefer contains
because it can account for not exact matching which isin
will require.
data = [
[
1,
"SODA",
],
[
2,
"JUICE 1L",
],
[
3,
"WATER 64OZ",
],
[
4,
"HOT TEA",
],
]
df = pd.DataFrame(data, columns=["ID", "Beverage"])
DRINK_LIST = ["SODA", "WATER", "COFFEE", "TEA", "JUICE"]
sdf = spark.createDataFrame(df)
Does anyone know the best way to do this?
CodePudding user response:
Assume this is your drinks array (or list):
val drinks = Array("SODA", "WATER", "COFFEE", "TEA", "JUICE")
We can convert this to a regex expression so we can apply it in rlike
API:
val regex = drinks.map(x => "(" x ")").mkString("|").toLowerCase()
which gives (soda)|(water)|(coffee)|(tea)|(juice)
.
Then, we can apply this to the dataset, like:
df = df.withColumn("is_within", rlike(lower(beverage), regex))
PS: converting the column to lowercase but also your list, just to be sure that everything is compatible.
I would rather use rlike
instead of contains
, since I am not sure if we can come to the same result using the latter without doing individual checks.
Good luck!
CodePudding user response:
%python
import pandas as pd
import pyspark.sql.functions as py
data = [[1, 'SODA'], [2, 'JUICE'], [3, 'WATER'], [4, 'HOT STEA']]
df = pd.DataFrame(data, columns=['ID', 'Beverage'])
sdf = spark.createDataFrame(df)
sdf.show()
DRINK_LIST = ["SODA", "WATER", "COFFEE", "TEA", "JUICE"]
# May need upper or lowercase conversion. Different to scala api.
sdf = sdf.withColumn('check', py.when(py.length(py.regexp_extract('Beverage', '(?=^|\s)(' '|'.join(DRINK_LIST) ')(?=\s|$)', 0)) > 0, True).otherwise(False))
sdf.show()
returns:
--- -------- -----
| ID|Beverage|check|
--- -------- -----
| 1| SODA| true|
| 2| JUICE| true|
| 3| WATER| true|
| 4|HOT STEA|false|
--- -------- -----