I am looking for a solution where i am performing GROUP BY, HAVING CLAUSE and ORDER BY Together in a Pyspark Code. Basically we need to shift some data from one dataframe to another with some conditions.
The SQL Query looks like this which i am trying to change into Pyspark
SELECT TABLE1.NAME, Count(TABLE1.NAME) AS COUNTOFNAME,
Count(TABLE1.ATTENDANCE) AS COUNTOFATTENDANCE INTO SCHOOL_DATA_TABLE
FROM TABLE1
WHERE (((TABLE1.NAME) Is Not Null))
GROUP BY TABLE1.NAME
HAVING (((Count(TABLE1.NAME))>1) AND ((Count(TABLE1.ATTENDANCE))<>5))
ORDER BY Count(TABLE1.NAME) DESC;
The Spark Code which i have tried and failing is: What i have done the i have taken columns from df to df2 on which operations need to be done:
df2= df.select('NAME','ATTENDANCE')
df2=df2.groupBy('NAME').agg(count('NAME').alias('name1').agg(count('ATTENDANCE').alias('NEW_ATTENDANCE'))).filter((col('name1')>1) & (col('NEW_ATTENDANCE') !=5))
SAMPLE DATA
rdd = spark.sparkContext.parallelize([
('Aayush', 10),
('Aayush', 9),
('Shiva', 5 ),
('Alia', 6),
('Aayan', 11),
('Alia',9)])
df_1 = spark.createDataFrame(rdd, schema=['NAME','ATTENDANCE'])
CodePudding user response:
Your code is almost ok, after fixing a few syntax issues it works.
Also, I think for "attendance" you want to use sum
rather than count
(otherwise it will be always the same value as of name count).
For sorting, simply add orderBy
.
df.withColumn("NAME", lower("NAME"))
.groupBy('NAME')
.agg(count('NAME').alias('name1'),sum('ATTENDANCE').alias('NEW_ATTENDANCE'))
.filter((col('name1')>1) & (col('NEW_ATTENDANCE') !=5))
.orderBy(col("NAME"))