I have a data frame looking like this
code 1 code 2 Fruit_Group temp_code
(string) (string) (string) (boolean)
12E5-11 12E5-11 Apple True
12E5-11 ERE5-11,12E5-11 Apple True
12E5-11 MMMM-11 Apple True #color to be changed
12E5-11 XXXX-11 Apple False #color to be changed
12E5-11 12E5-11 Apple True
12E5-11 12E5-11, ERE5-11 Apple True
By default the Fruit_Group
is assigned to some fruits:.
The color remains Apple when the first four digits of "code 1" match with first four digits of "code 2" (also consider ',' - the four digits should be matched with any one of "code 2")
I have to change the Fruit group for rows 3 and 4 since no match is found:
- If digits do not match then check
temp_code
: iftemp_code == True
--> Banana else Orange
Can anyone help me with this?
Data:
x = [("12E5-11", "12E5-11", "Apple", True), ("12E5-11", "ERE5-11,12E5-11", "Apple", True), ("12E5-11", "MMMM-11", "Apple", True ), ("12E5-11", "XXXX-11" ,"Apple", False), ("12E5-11", "12E5-11", "Apple", True), ("12E5-11", "ERE5-11,12E5-11", "Apple", True)]
Fruits_df = spark.createDataFrame(x, schema=["code 1", "code 2","Fruit_Group","temp_code"])
CodePudding user response:
You can use rlike
to check if code 2
contains a code starting with the first 4 characters of code 1
. If those characters are 1234, the first subtlety is to avoid matching with something that contains 1234 but does not start with 1234. For that, we can add (^|,)
at the beginning of the regex. The second subtlety is that rlike
in spark (python and scala) can only be used with a fixed string and not with a column, expect when used inside expr
:-) Once you have checked whether or not code 2
matches your criteria, you can use when
to change your fruit accordingly:
Fruits_df\
.withColumn("pattern", F.concat(F.lit("(^|,)"), F.substring("code 1", 0, 4)))\
.withColumn("Fruit_Group",
F.when(F.expr("`code 2` rlike `pattern`"), F.col("Fruit_Group"))
.when(F.col("temp_code"), F.lit("Banana"))
.otherwise(F.lit("Orange"))
).show()
------- --------------- ----------- --------- ---------
| code 1| code 2|Fruit_Group|temp_code| pattern|
------- --------------- ----------- --------- ---------
|12E5-11| 12E5-11| Apple| true|(^|,)12E5|
|12E5-11|ERE5-11,12E5-11| Apple| true|(^|,)12E5|
|12E5-11| MMMM-11| Banana| true|(^|,)12E5|
|12E5-11| XXXX-11| Orange| false|(^|,)12E5|
|12E5-11| 12E5-11| Apple| true|(^|,)12E5|
|12E5-11|ERE5-11,12E5-11| Apple| true|(^|,)12E5|
------- --------------- ----------- --------- ---------