Is it possible to get the first Datetime
of each day from a certain dataframe?
Schema:
root
|-- Datetime: timestamp (nullable = true)
|-- Quantity: integer (nullable = true)
------------------- --------
| Datetime|Quantity|
------------------- --------
|2021-09-10 10:08:11| 200|
|2021-09-10 10:08:16| 100|
|2021-09-11 10:05:11| 100|
|2021-09-11 10:07:25| 100|
|2021-09-11 10:07:14| 3000|
|2021-09-12 09:24:11| 1000|
------------------- --------
Desired output:
------------------- --------
| Datetime|Quantity|
------------------- --------
|2021-09-10 10:08:11| 200|
|2021-09-11 10:05:11| 100|
|2021-09-12 09:24:11| 1000|
------------------- --------
CodePudding user response:
You can use row_number
for that. Simply define a Window partitioned by day and ordered by Datetime
:
from pyspark.sql import functions as F, Window
w = Window.partitionBy(F.to_date("Datetime")).orderBy("Datetime")
df1 = df.withColumn("rn", F.row_number().over(w)).filter("rn = 1").drop("rn")
df1.show()
# ------------------- --------
#| Datetime|Quantity|
# ------------------- --------
#|2021-09-10 10:08:11| 200|
#|2021-09-11 10:05:11| 100|
#|2021-09-12 09:24:11| 1000|
# ------------------- --------