Home > Blockchain >  Spark: getting the first entry according to a date groupBy
Spark: getting the first entry according to a date groupBy

Time:02-18

Is it possible to get the first Datetime of each day from a certain dataframe?

Schema:

root
 |-- Datetime: timestamp (nullable = true)
 |-- Quantity: integer (nullable = true)


 ------------------- -------- 
|           Datetime|Quantity|
 ------------------- -------- 
|2021-09-10 10:08:11|     200|
|2021-09-10 10:08:16|     100|
|2021-09-11 10:05:11|     100|
|2021-09-11 10:07:25|     100|
|2021-09-11 10:07:14|    3000|
|2021-09-12 09:24:11|    1000|
 ------------------- -------- 

Desired output:

 ------------------- -------- 
|           Datetime|Quantity|
 ------------------- -------- 
|2021-09-10 10:08:11|     200|
|2021-09-11 10:05:11|     100|
|2021-09-12 09:24:11|    1000|
 ------------------- -------- 

CodePudding user response:

You can use row_number for that. Simply define a Window partitioned by day and ordered by Datetime:

from pyspark.sql import functions as F, Window

w = Window.partitionBy(F.to_date("Datetime")).orderBy("Datetime")

df1 = df.withColumn("rn", F.row_number().over(w)).filter("rn = 1").drop("rn")

df1.show()
# ------------------- -------- 
#|           Datetime|Quantity|
# ------------------- -------- 
#|2021-09-10 10:08:11|     200|
#|2021-09-11 10:05:11|     100|
#|2021-09-12 09:24:11|    1000|
# ------------------- -------- 
  • Related