Home > OS >  pyspark lead operation for creating time intervals
pyspark lead operation for creating time intervals

Time:03-05

I have a dataframe like this:

time
First
Second
Third

I would like to have an output like this using Pyspark, where I create an interval with the row itself and its subsequent row:

time start end
First First Second
Second Second Third
Third ... ....

Do you have any suggestions?

CodePudding user response:

You first need to order the dataframe by time variable, then you need to create a monotonically_increasing_id to create a lead column over an ordered window.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window().orderBy('id')

df \
  .sort('time') \
  .withColumn('id', F.monotonically_increasing_id()) \
  .withColumn('start', F.col('time')) \
  .withColumn('end', F.lead(F.col('time')).over(w)) \
  .drop('id')

CodePudding user response:

Can avoid creating more columns and the pain of dropping them by using last and lead in window functions

w = Window.partitionBy().orderBy('time')
df =df.withColumn('Start', last('time').over(w)).withColumn('End', lead('time').over(w))

df.show()
  • Related