replace column values in pyspark dataframe based multiple conditions-CodePudding

I have the below pyspark dataframe

a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']


df = pd.DataFrame(dict(dcode=a, zone=b))



dcode   zone
0   480s  North
1   480s   West
2   499s   East
3   499s  North
4   650s   East
5   650s  North
6   702s  North
7   702s   West
8   736s  North
9   736s  South
10  736s   West
11  737s  North
12  737s   West

I want my data frame looks like -

 dcode   zone output
0   480s  North     NW
1   480s   West     NW
2   499s   East       
3   499s  North     NW
4   650s   East       
5   650s  North     NW
6   702s  North       
7   702s   West       
8   736s  North       
9   736s  South       
10  736s   West       
11  737s  North       
12  737s   West

For the same I am using this logic, but it's not giving the desired results.

  df_ = df.withColumn("output", F.when((F.col("Zone") == "North") | (F.col("Zone") == "West") & (F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"), "NW"))

I want NW in output column only when zone is north or west and decode is not in 736,737s,702s.

CodePudding user response：

You can directly use SQL style expressions (expr function).

import pyspark.sql.functions as F
......
df = df.withColumn('output', F.expr("case when zone in ('North', 'West') and dcode not in ('736s', '737s', '702s') then 'NW' end"))
......

CodePudding user response：

Just check your parenthesis

By the way, df = pd.DataFrame(dict(dcode=a, zone=b)) is not PySpark

from pyspark.sql import functions as F
import pandas as pd


a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']


df = pd.DataFrame(dict(dcode=a, zone=b))

df_ = spark.createDataFrame(df)

df_ = df_.withColumn("output", F.when((\
                                       ((F.col("Zone") == "North") | (F.col("Zone") == "West")) & ((F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"))\
                                      ), "NW"))

df_.show()

 ----- ----- ------ 
|dcode| zone|output|
 ----- ----- ------ 
| 480s|North|    NW|
| 480s| West|    NW|
| 499s| East|  null|
| 499s|North|    NW|
| 650s| East|  null|
| 650s|North|    NW|
| 702s|North|    NW|
| 702s| West|    NW|
| 736s|North|    NW|
| 736s|South|  null|
| 736s| West|    NW|
| 737s|North|    NW|
| 737s| West|    NW|
 ----- ----- ------

CodePudding user response：

Please consider firstly converting your pandas df to a spark one, since you are using pypark syntax. Then I would advise rewriting your code into a more concise and clearer way, using isin:

from pyspark.sql import functions as F
df = spark.createDataFrame(df)

df_ = df.withColumn("output", F.when(
  (F.col("Zone").isin("North","West")) & (~F.col("dcode").isin('736s','737s','702s')
                                         ),"NW").otherwise(""))

>>> df_.show(truncate=False)

 ----- ----- ------ 
|dcode|zone |output|
 ----- ----- ------ 
|480s |North|NW    |
|480s |West |NW    |
|499s |East |      |
|499s |North|NW    |
|650s |East |      |
|650s |North|NW    |
|702s |North|      |
|702s |West |      |
|736s |North|      |
|736s |South|      |
|736s |West |      |
|737s |North|      |
|737s |West |      |
 ----- ----- ------