pyspark filter the value of a column to assign a new column-CodePudding

In python, you can write A filter and assign a value to a new column by using df.loc[df["A"].isin([1,2,3]),"newColumn"] ="numberType". How does this work in pyspark?

CodePudding user response：

FYI, in Python there is no such thing as DataFrame. The code you showed above are Pandas syntax - a Python library written for data analysis and manipulation.

For your problem, you can use when, lit and col from pyspark.sql.functions to achieve this.

from pyspark.sql.functions import when, lit, col

df1 = df.withColumn("newColumn", 
    when(col("A").isin([1, 2, 3], 
        lit("numberType")).otherwise(lit("notNumberType")))

df1.show(truncate=False)

CodePudding user response：

Use when function to filter rows, and isin function to check existence in list:

pdf = pd.DataFrame(data=[[1,""],[2,""],[3,""],[4,""],[5,""]], columns=["A", "newColumn"])
pdf.loc[pdf["A"].isin([1,2,3]), "newColumn"] = "numberType"
print(pdf)

   A   newColumn
0  1  numberType
1  2  numberType
2  3  numberType
3  4            
4  5            


import pyspark.sql.functions as F
sdf = spark.createDataFrame(data=[[1,""],[2,""],[3,""],[4,""],[5,""]], schema=["A", "newColumn"])
sdf = sdf.withColumn("newColumn", F.when(F.col("A").isin([1,2,3]), F.lit("numberType")))
sdf.show()

 --- ---------- 
|  A| newColumn|
 --- ---------- 
|  1|numberType|
|  2|numberType|
|  3|numberType|
|  4|      null|
|  5|      null|
 --- ----------