I have a dataframe like this:
time |
---|
First |
Second |
Third |
I would like to have an output like this using Pyspark, where I create an interval with the row itself and its subsequent row:
time | start | end |
---|---|---|
First | First | Second |
Second | Second | Third |
Third | ... | .... |
Do you have any suggestions?
CodePudding user response:
You first need to order the dataframe by time
variable, then you need to create a monotonically_increasing_id
to create a lead
column over an ordered window.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().orderBy('id')
df \
.sort('time') \
.withColumn('id', F.monotonically_increasing_id()) \
.withColumn('start', F.col('time')) \
.withColumn('end', F.lead(F.col('time')).over(w)) \
.drop('id')
CodePudding user response:
Can avoid creating more columns and the pain of dropping them by using last and lead in window functions
w = Window.partitionBy().orderBy('time')
df =df.withColumn('Start', last('time').over(w)).withColumn('End', lead('time').over(w))
df.show()