In python, you can write A filter and assign a value to a new column by using df.loc[df["A"].isin([1,2,3]),"newColumn"] ="numberType". How does this work in pyspark?
CodePudding user response:
FYI, in Python there is no such thing as DataFrame. The code you showed above are Pandas syntax - a Python library written for data analysis and manipulation.
For your problem, you can use when
, lit
and col
from pyspark.sql.functions to achieve this.
from pyspark.sql.functions import when, lit, col
df1 = df.withColumn("newColumn",
when(col("A").isin([1, 2, 3],
lit("numberType")).otherwise(lit("notNumberType")))
df1.show(truncate=False)
CodePudding user response:
Use when
function to filter rows, and isin
function to check existence in list:
pdf = pd.DataFrame(data=[[1,""],[2,""],[3,""],[4,""],[5,""]], columns=["A", "newColumn"])
pdf.loc[pdf["A"].isin([1,2,3]), "newColumn"] = "numberType"
print(pdf)
A newColumn
0 1 numberType
1 2 numberType
2 3 numberType
3 4
4 5
import pyspark.sql.functions as F
sdf = spark.createDataFrame(data=[[1,""],[2,""],[3,""],[4,""],[5,""]], schema=["A", "newColumn"])
sdf = sdf.withColumn("newColumn", F.when(F.col("A").isin([1,2,3]), F.lit("numberType")))
sdf.show()
--- ----------
| A| newColumn|
--- ----------
| 1|numberType|
| 2|numberType|
| 3|numberType|
| 4| null|
| 5| null|
--- ----------