How can we use the Round function with Group by in pyspark? i have a spark dataframe through which i need to generate a result by using group by and round function??
data1 = [{'Name':'Jhon','ID':21.528,'Add':'USA','ID_2':'30.90'},
{'Name':'Joe','ID':3.69,'Add':'USA','ID_2':'12.80'},
{'Name':'Tina','ID':2.48,'Add':'IND','ID_2':'11.07'},
{'Name':'Jhon','ID':22.22, 'Add':'USA','ID_2':'34.87'},
{'Name':'Joe','ID':5.33,'Add':'INA','ID_2':'56.89'}]
a = sc.parallelize(data1)
In SQL query will be like
select count(ID) as newid, count(ID_2) as secondaryid, round(([newid]
[secondaryid])/[newid]* 200,1) AS [NEW_PERCENTAGE] FROM DATA1
groupby Name
CodePudding user response:
You cannot use round
inside a groupby
, you need to create a new column afterwards:
import pyspark.sql.functions as F
df = spark.createDataFrame(a)
(df.groupby('Name')
.agg(
F.count('ID').alias('newid'),
F.count('ID_2').alias('secondaryid')
)
.withColumn('NEW_PERCENTAGE', F.round(200 * (F.col('newid') F.col('secondaryid')) / F.col('newid'), 1))
).show()
---- ----- ----------- --------------
|Name|newid|secondaryid|NEW_PERCENTAGE|
---- ----- ----------- --------------
| Joe| 2| 2| 400.0|
|Tina| 1| 1| 400.0|
|Jhon| 2| 2| 400.0|
---- ----- ----------- --------------