Home > OS >  replace column values in pyspark dataframe based multiple conditions
replace column values in pyspark dataframe based multiple conditions

Time:10-29

I have the below pyspark dataframe

a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']


df = pd.DataFrame(dict(dcode=a, zone=b))



dcode   zone
0   480s  North
1   480s   West
2   499s   East
3   499s  North
4   650s   East
5   650s  North
6   702s  North
7   702s   West
8   736s  North
9   736s  South
10  736s   West
11  737s  North
12  737s   West

I want my data frame looks like -

 dcode   zone output
0   480s  North     NW
1   480s   West     NW
2   499s   East       
3   499s  North     NW
4   650s   East       
5   650s  North     NW
6   702s  North       
7   702s   West       
8   736s  North       
9   736s  South       
10  736s   West       
11  737s  North       
12  737s   West  

For the same I am using this logic, but it's not giving the desired results.

  df_ = df.withColumn("output", F.when((F.col("Zone") == "North") | (F.col("Zone") == "West") & (F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"), "NW"))

I want NW in output column only when zone is north or west and decode is not in 736,737s,702s.

CodePudding user response:

You can directly use SQL style expressions (expr function).

import pyspark.sql.functions as F
......
df = df.withColumn('output', F.expr("case when zone in ('North', 'West') and dcode not in ('736s', '737s', '702s') then 'NW' end"))
......

CodePudding user response:

Just check your parenthesis

By the way, df = pd.DataFrame(dict(dcode=a, zone=b)) is not PySpark

from pyspark.sql import functions as F
import pandas as pd


a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']


df = pd.DataFrame(dict(dcode=a, zone=b))

df_ = spark.createDataFrame(df)

df_ = df_.withColumn("output", F.when((\
                                       ((F.col("Zone") == "North") | (F.col("Zone") == "West")) & ((F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"))\
                                      ), "NW"))

df_.show()

 ----- ----- ------ 
|dcode| zone|output|
 ----- ----- ------ 
| 480s|North|    NW|
| 480s| West|    NW|
| 499s| East|  null|
| 499s|North|    NW|
| 650s| East|  null|
| 650s|North|    NW|
| 702s|North|    NW|
| 702s| West|    NW|
| 736s|North|    NW|
| 736s|South|  null|
| 736s| West|    NW|
| 737s|North|    NW|
| 737s| West|    NW|
 ----- ----- ------ 

CodePudding user response:

Please consider firstly converting your pandas df to a spark one, since you are using pypark syntax. Then I would advise rewriting your code into a more concise and clearer way, using isin:

from pyspark.sql import functions as F
df = spark.createDataFrame(df)

df_ = df.withColumn("output", F.when(
  (F.col("Zone").isin("North","West")) & (~F.col("dcode").isin('736s','737s','702s')
                                         ),"NW").otherwise(""))

>>> df_.show(truncate=False)

 ----- ----- ------ 
|dcode|zone |output|
 ----- ----- ------ 
|480s |North|NW    |
|480s |West |NW    |
|499s |East |      |
|499s |North|NW    |
|650s |East |      |
|650s |North|NW    |
|702s |North|      |
|702s |West |      |
|736s |North|      |
|736s |South|      |
|736s |West |      |
|737s |North|      |
|737s |West |      |
 ----- ----- ------ 
  • Related