Home > OS >  manipulate one column based on the first 4 digits of another column PySpark
manipulate one column based on the first 4 digits of another column PySpark

Time:01-05

I have a data frame looking like this

code 1      code 2               Fruit_Group           temp_code
(string)   (string)              (string)             (boolean)
 
12E5-11     12E5-11                Apple              True 
12E5-11     ERE5-11,12E5-11        Apple              True
12E5-11     MMMM-11                Apple              True       #color to be changed 
12E5-11     XXXX-11                Apple              False       #color to be changed
12E5-11     12E5-11                Apple              True 
12E5-11     12E5-11, ERE5-11       Apple              True

By default the Fruit_Group is assigned to some fruits:.

The color remains Apple when the first four digits of "code 1" match with first four digits of "code 2" (also consider ',' - the four digits should be matched with any one of "code 2")

I have to change the Fruit group for rows 3 and 4 since no match is found:

  • If digits do not match then check temp_code: if temp_code == True --> Banana else Orange

Can anyone help me with this?

Data:

x = [("12E5-11", "12E5-11", "Apple", True), ("12E5-11", "ERE5-11,12E5-11", "Apple", True), ("12E5-11", "MMMM-11", "Apple", True ), ("12E5-11", "XXXX-11" ,"Apple", False), ("12E5-11", "12E5-11", "Apple", True), ("12E5-11", "ERE5-11,12E5-11", "Apple", True)]

Fruits_df = spark.createDataFrame(x, schema=["code 1", "code 2","Fruit_Group","temp_code"])

CodePudding user response:

You can use rlike to check if code 2 contains a code starting with the first 4 characters of code 1. If those characters are 1234, the first subtlety is to avoid matching with something that contains 1234 but does not start with 1234. For that, we can add (^|,) at the beginning of the regex. The second subtlety is that rlike in spark (python and scala) can only be used with a fixed string and not with a column, expect when used inside expr :-) Once you have checked whether or not code 2 matches your criteria, you can use when to change your fruit accordingly:

Fruits_df\
    .withColumn("pattern", F.concat(F.lit("(^|,)"), F.substring("code 1", 0, 4)))\
    .withColumn("Fruit_Group", 
        F.when(F.expr("`code 2` rlike `pattern`"), F.col("Fruit_Group"))
         .when(F.col("temp_code"), F.lit("Banana"))
         .otherwise(F.lit("Orange"))
    ).show()
 ------- --------------- ----------- --------- --------- 
| code 1|         code 2|Fruit_Group|temp_code|  pattern|
 ------- --------------- ----------- --------- --------- 
|12E5-11|        12E5-11|      Apple|     true|(^|,)12E5|
|12E5-11|ERE5-11,12E5-11|      Apple|     true|(^|,)12E5|
|12E5-11|        MMMM-11|     Banana|     true|(^|,)12E5|
|12E5-11|        XXXX-11|     Orange|    false|(^|,)12E5|
|12E5-11|        12E5-11|      Apple|     true|(^|,)12E5|
|12E5-11|ERE5-11,12E5-11|      Apple|     true|(^|,)12E5|
 ------- --------------- ----------- --------- --------- 
  • Related