Not able to understand the output value of time after using the window function?-CodePudding

I was trying to find out the time window in which my dataframes values lie, but i am not able to understand the output. I am not able to understand how the: window(timeColumn, windowDuration, slideDuration=None, startTime=None) is working.

This is the code:

df = spark.createDataFrame([("0000-01-01 00:00:00", 1),("1970-01-01 19:02:34", 1),("1970-01-01 19:01:29", 1)]).toDF("date", "val")
from pyspark.sql.functions import window
w = df.groupBy(window("date", windowDuration="55 seconds")).sum("val").alias("sum")
display(w)

Using databricks.

Can someone please tell me the output and explain it how it works.

CodePudding user response：

From a timestamp column, window will create a "bucket" (start and end time) that contains the input timestamp. The "size" of the bucket depends on the windowDuration duration parameter.

For exemple, you have a timestamp 2021-10-29 11:13:51 and you apply the window function with windowDuration = "15 minutes", the new column will be a struct with start = "2021-10-29 11:00:00" and end = "2021-10-29 11:15:00". Between start and end, you have 15 minutes, and your timestamp is contained in between.

In your current code, you use windowDuration="55 seconds". According to the documentation:

The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals.

It means that from the date 1970-01-01 00:00:00, it will create windows of 55 secondes length. First one will be 1970-01-01 00:00:00 ==> 1970-01-01 00:00:55, seconde one will be 1970-01-01 00:00:55 ==> 1970-01-01 00:01:50, etc ... It works but the start and end are not regular compare to a 1 minutes param for example.