Home > Enterprise >  How to generate the max values for new columns in PySpark dataframe?
How to generate the max values for new columns in PySpark dataframe?

Time:07-19

Suppose I have a pyspark dataframe df.

 --- --- 
|  a|  b|
 --- --- 
|  1|  2|
|  2|  3|
|  4|  5|
 --- --- 

I'd like to add new column c.

column c = max(0, column b - 100)

 --- --- --- 
|  a|  b|  c|
 --- --- --- 
|  1|200|100|
|  2|300|200|
|  4| 50|  0|
 --- --- --- 

How should I generate the new column c in pyspark dataframe? Thanks in advance!

CodePudding user response:

Hope you are looking something like this:

from pyspark.sql.functions import col, lit, greatest

df = spark.createDataFrame(
    [
        (1, 200), 
        (2, 300),
        (4, 50),
    ],
    ["a", "b"]  
)
df_new = df.withColumn("c", greatest(lit(0), col("b")-lit(100)))
.show()

  • Related