Home > database >  Use Group By and Aggregate Function in pyspark?
Use Group By and Aggregate Function in pyspark?

Time:11-20

I am looking for a Solution to how to use Group by Aggregate Functions together in Pyspark? My Dataframe looks like this:

df = sc.parallelize([
    ('23-09-2020', 'CRICKET'),
    ('25-11-2020', 'CRICKET'),
    ('13-09-2021', 'FOOTBALL'),
    ('20-11-2021', 'BASKETBALL'),
    ('12-12-2021', 'FOOTBALL')]).toDF(['DATE', 'SPORTS_INTERESTED'])

I want to apply group by on the SPORTS_INTERESTED Column and select MIN of date From DATE Column . Below is the Query i am using

from pyspark.sql.functions import  min
df=df.groupby('SPORTS_INTERESTED').agg(count('SPORTS_INTERESTED').alias('FIRST_COUNT'),(F.min('DATE').alias('MIN_OF_DATE_COLUMN'))).filter((col('FIRST_COUNT')> 1))

But when i am applying the above Query , I dont know why it is giving MAX date rather than MIN date in Output values DESIRED OUTPUT

##  ----------------- ------------------- 
## |SPORTS_INTERESTED| MIN_OF_DATE_COLUMN|    
##  ------ ---------- ------------------- 
## |  CRICKET        |23-09-2020         |
##  ------ ---------- ------------------- 
## | FOOTBALL        |13-09-2021         |
    ----------------- ------------------- 

Output i am getting:

 ##  ----------------- ---------------------- 
    ## |SPORTS_INTERESTED| MIN_OF_DATE_COLUMN|    
    ##  ------ ---------- ------------------- 
    ## |  CRICKET        |25-11-2020         |
    ##  ------ ---------- ------------------- 
    ## | FOOTBALL        |12-12-2021         |
        ----------------- ------------------- 

BOTH COLUMNS ARE OF STRING DATATYPE

CodePudding user response:

First, convert string to date format, and then apply min:

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[
    ('23-09-2020', 'CRICKET'),
    ('25-11-2020', 'CRICKET'),
    ('13-09-2021', 'FOOTBALL'),
    ('20-11-2021', 'BASKETBALL'),
    ('12-12-2021', 'FOOTBALL')    
], schema=['DATE', 'SPORTS_INTERESTED'])

df = df.withColumn("DATE", F.to_date("DATE", format="dd-MM-yyyy"))
df = df.groupBy("SPORTS_INTERESTED").agg(F.min("DATE").alias("MIN_OF_DATE"))

[Out]:
 ----------------- ----------- 
|SPORTS_INTERESTED|MIN_OF_DATE|
 ----------------- ----------- 
|BASKETBALL       |2021-11-20 |
|FOOTBALL         |2021-09-13 |
|CRICKET          |2020-09-23 |
 ----------------- ----------- 
  • Related