row-wise operation using function in pyspark-CodePudding

I have a function like this:

  def number(row):
        if row['temp'] == '1 Person':
            return 'One'
        elif row['temp'] == '2 Persons':
            return 'Two'
        elif row['temp'] == '3 Persons':
            return 'Three'
        elif row['temp'] in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons',
            '9 Persons','10 Persons','11 Persons']:
            return 'More'
        else:
            return None

Now, I want to change the values in my data frame looping through row-wise.

How can I loop through my data frame and replace the values according to the function above in Pyspark?

CodePudding user response：

Create a sample data frame with the data

 df_row = spark.createDataFrame(
 [("1 Person", "2", "3"),("2 Persons", "2", "3"),
  ("3 Persons", "2", "3"),("4 Persons", "2", "3"),
  ("5 Persons", "2", "3"),("6 Persons", "2", "3"),
  ("7 Persons", "2", "3"),("8 Persons", "2", "3"),
  ("9 Persons", "2", "3")],
  schema=['temp', 'col2', 'col3']
 )

Define the function and also create the UDF function

 def number(row):
     if row == '1 Person':
         return 'One'
     elif row == '2 Persons':
         return 'Two'
     elif row == '3 Persons':
         return 'Three'
     elif row in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons',
         '9 Persons','10 Persons','11 Persons']:
         return 'More'
     else:
         return None

 numberUDF = udf(lambda z: number(z),StringType())

Rewrite the column function 'temp'

 df_row = df_row.withColumn('temp',numberUDF(col('temp')))

Please check the below output:

enter image description here

CodePudding user response：

The Python function can (almost identical) be translated into a native SQL statement:

from pyspark.sql import functions as F

df.withColumn('result', 
    F.expr("""case
                when temp == '1 Person' then 'One'
                when temp == '2 Persons' then 'Two'
                when temp == '3 Persons' then 'Three'                
                when temp in ('4 Persons', '5 Persons', '6 Persons', '7 Persons','8 Persons','9 Persons','10 Persons','11 Persons') then 'More'
                else null 
              end""")).show()

While an udf based solution offers more flexibility, the sql statements has a better performance as no Python code has to be run by Spark.