I have the below pyspark dataframe
a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']
df = pd.DataFrame(dict(dcode=a, zone=b))
dcode zone
0 480s North
1 480s West
2 499s East
3 499s North
4 650s East
5 650s North
6 702s North
7 702s West
8 736s North
9 736s South
10 736s West
11 737s North
12 737s West
I want my data frame looks like -
dcode zone output
0 480s North NW
1 480s West NW
2 499s East
3 499s North NW
4 650s East
5 650s North NW
6 702s North
7 702s West
8 736s North
9 736s South
10 736s West
11 737s North
12 737s West
For the same I am using this logic, but it's not giving the desired results.
df_ = df.withColumn("output", F.when((F.col("Zone") == "North") | (F.col("Zone") == "West") & (F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"), "NW"))
I want NW in output column only when zone is north or west and decode is not in 736,737s,702s.
CodePudding user response:
You can directly use SQL style expressions (expr
function).
import pyspark.sql.functions as F
......
df = df.withColumn('output', F.expr("case when zone in ('North', 'West') and dcode not in ('736s', '737s', '702s') then 'NW' end"))
......
CodePudding user response:
Just check your parenthesis
By the way, df = pd.DataFrame(dict(dcode=a, zone=b))
is not PySpark
from pyspark.sql import functions as F
import pandas as pd
a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']
df = pd.DataFrame(dict(dcode=a, zone=b))
df_ = spark.createDataFrame(df)
df_ = df_.withColumn("output", F.when((\
((F.col("Zone") == "North") | (F.col("Zone") == "West")) & ((F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"))\
), "NW"))
df_.show()
----- ----- ------
|dcode| zone|output|
----- ----- ------
| 480s|North| NW|
| 480s| West| NW|
| 499s| East| null|
| 499s|North| NW|
| 650s| East| null|
| 650s|North| NW|
| 702s|North| NW|
| 702s| West| NW|
| 736s|North| NW|
| 736s|South| null|
| 736s| West| NW|
| 737s|North| NW|
| 737s| West| NW|
----- ----- ------
CodePudding user response:
Please consider firstly converting your pandas
df
to a spark
one, since you are using pypark
syntax. Then I would advise rewriting your code into a more concise and clearer way, using isin
:
from pyspark.sql import functions as F
df = spark.createDataFrame(df)
df_ = df.withColumn("output", F.when(
(F.col("Zone").isin("North","West")) & (~F.col("dcode").isin('736s','737s','702s')
),"NW").otherwise(""))
>>> df_.show(truncate=False)
----- ----- ------
|dcode|zone |output|
----- ----- ------
|480s |North|NW |
|480s |West |NW |
|499s |East | |
|499s |North|NW |
|650s |East | |
|650s |North|NW |
|702s |North| |
|702s |West | |
|736s |North| |
|736s |South| |
|736s |West | |
|737s |North| |
|737s |West | |
----- ----- ------