pyspark lead operation for creating time intervals-CodePudding

I have a dataframe like this:

time
First
Second
Third

I would like to have an output like this using Pyspark, where I create an interval with the row itself and its subsequent row:

time	start	end
First	First	Second
Second	Second	Third
Third	...	....

Do you have any suggestions?

CodePudding user response：

You first need to order the dataframe by time variable, then you need to create a monotonically_increasing_id to create a lead column over an ordered window.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window().orderBy('id')

df \
  .sort('time') \
  .withColumn('id', F.monotonically_increasing_id()) \
  .withColumn('start', F.col('time')) \
  .withColumn('end', F.lead(F.col('time')).over(w)) \
  .drop('id')

CodePudding user response：

Can avoid creating more columns and the pain of dropping them by using last and lead in window functions

w = Window.partitionBy().orderBy('time')
df =df.withColumn('Start', last('time').over(w)).withColumn('End', lead('time').over(w))

df.show()