I have a function like this:
def number(row):
if row['temp'] == '1 Person':
return 'One'
elif row['temp'] == '2 Persons':
return 'Two'
elif row['temp'] == '3 Persons':
return 'Three'
elif row['temp'] in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons',
'9 Persons','10 Persons','11 Persons']:
return 'More'
else:
return None
Now, I want to change the values in my data frame looping through row-wise.
How can I loop through my data frame and replace the values according to the function above in Pyspark?
CodePudding user response:
Create a sample data frame with the data
df_row = spark.createDataFrame( [("1 Person", "2", "3"),("2 Persons", "2", "3"), ("3 Persons", "2", "3"),("4 Persons", "2", "3"), ("5 Persons", "2", "3"),("6 Persons", "2", "3"), ("7 Persons", "2", "3"),("8 Persons", "2", "3"), ("9 Persons", "2", "3")], schema=['temp', 'col2', 'col3'] )
Define the function and also create the UDF function
def number(row): if row == '1 Person': return 'One' elif row == '2 Persons': return 'Two' elif row == '3 Persons': return 'Three' elif row in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons', '9 Persons','10 Persons','11 Persons']: return 'More' else: return None numberUDF = udf(lambda z: number(z),StringType())
Rewrite the column function 'temp'
df_row = df_row.withColumn('temp',numberUDF(col('temp')))
Please check the below output:
CodePudding user response:
The Python function can (almost identical) be translated into a native SQL statement:
from pyspark.sql import functions as F
df.withColumn('result',
F.expr("""case
when temp == '1 Person' then 'One'
when temp == '2 Persons' then 'Two'
when temp == '3 Persons' then 'Three'
when temp in ('4 Persons', '5 Persons', '6 Persons', '7 Persons','8 Persons','9 Persons','10 Persons','11 Persons') then 'More'
else null
end""")).show()
While an udf based solution offers more flexibility, the sql statements has a better performance as no Python code has to be run by Spark.