Home > Software engineering >  Pyspark dataframe: Create a new numeric column from a string column and calculate average
Pyspark dataframe: Create a new numeric column from a string column and calculate average

Time:06-04

I have a pyspark dataframe like the input data below. subject score column type is string. I want to first covert the string column type to integer column type The desired result is shown in Output 1.

I wish to calculate the average in a new numeric column avg_subject_score (do not want to typecast exiting column). The desired new column should be like shown in Output 2

Then, add a column (name "grade" StringType()) to this new dataframe (containing the grouped averages) that contains the string "Good" if the average score is between 50 and 99, "Very Good" if the average score is above 100, and "Fail" if average score is less than 50. The desired result is in Output 3

 ------ ------------------- 
|ID    |subject_score      |
 ------ ------------------- 
|123456|100                |
|123456|50                 |
|123456|0                  |
|789292|200                |
|789292|200                |
|789292|100                |
|239000|50                 |
|239000|100                |
|239000|NA                 |
|239000|NA                 |
 ------ ------------------- ```



Output 1 - without NA 
 ------ ------------------- 
|ID    |converted_score    |
 ------ ------------------- 
|123456|100                |
|123456|50                 |
|123456|0                  |
|789292|200                |
|789292|200                |
|789292|100                |
|239000|50                 |
|239000|100                |
 ------ ------------------- 

Output 2
 ------ ------------------- 
|ID    |avg_subject_score  |
 ------ ------------------- 
|123456|50                 |
|789292|167                |
|239000|38                 |
 ------ ------------------- 


Output 3
 ------ ------------------- ------------- 
|id    |avg_subject_score  |grade        |
 ------ ------------------- ------------- 
|123456|50                 |Good         |
|789292|167                |Very Good    |
|239000|38                 |Fail         |
 ------ ------------------- ------------- 

CodePudding user response:

This looks more like a homework as you should try to explore the basic functions on your own

But just for sake of it first lets create new column with required datatype

dfv=df2.withColumn("converted_score",col("subject_score").cast("long")).drop("subject_score").dropna(how="all",subset=["converted_score"])

now create group and avg

dfv=dfv.groupBy("ID").agg(avg("converted_score").alias("avg_subject_score"))

now you can use when to add last column

dfv=dfv.withColumn("grade",when(col("avg_subject_score")>100,"very good").when(col("avg_subject_score")>50,"good").otherwise("fail"))
  • Related