Home > Back-end >  How to Perform GroupBy , Having and Order by together in Pyspark
How to Perform GroupBy , Having and Order by together in Pyspark

Time:11-17

I am looking for a solution where i am performing GROUP BY, HAVING CLAUSE and ORDER BY Together in a Pyspark Code. Basically we need to shift some data from one dataframe to another with some conditions.

The SQL Query looks like this which i am trying to change into Pyspark

SELECT TABLE1.NAME, Count(TABLE1.NAME) AS COUNTOFNAME, 
Count(TABLE1.ATTENDANCE) AS COUNTOFATTENDANCE INTO SCHOOL_DATA_TABLE
FROM TABLE1
WHERE (((TABLE1.NAME) Is Not Null))
GROUP BY TABLE1.NAME
HAVING (((Count(TABLE1.NAME))>1) AND ((Count(TABLE1.ATTENDANCE))<>5))
ORDER BY Count(TABLE1.NAME) DESC;

The Spark Code which i have tried and failing is: What i have done the i have taken columns from df to df2 on which operations need to be done:

df2= df.select('NAME','ATTENDANCE')
    df2=df2.groupBy('NAME').agg(count('NAME').alias('name1').agg(count('ATTENDANCE').alias('NEW_ATTENDANCE'))).filter((col('name1')>1) & (col('NEW_ATTENDANCE') !=5))

SAMPLE DATA

rdd = spark.sparkContext.parallelize([
    ('Aayush', 10),
    ('Aayush', 9),
    ('Shiva', 5 ),
    ('Alia', 6),
    ('Aayan', 11),
    ('Alia',9)])
df_1 = spark.createDataFrame(rdd, schema=['NAME','ATTENDANCE'])

CodePudding user response:

Your code is almost ok, after fixing a few syntax issues it works. Also, I think for "attendance" you want to use sum rather than count (otherwise it will be always the same value as of name count). For sorting, simply add orderBy.

df.withColumn("NAME", lower("NAME"))
.groupBy('NAME')
.agg(count('NAME').alias('name1'),sum('ATTENDANCE').alias('NEW_ATTENDANCE'))
.filter((col('name1')>1) & (col('NEW_ATTENDANCE') !=5))
.orderBy(col("NAME"))
  • Related