Home > other >  Pyspark Dataframe substract rows from the same column?
Pyspark Dataframe substract rows from the same column?

Time:07-08

I have the following example as a pyspark dataframe:

Timeframe Person Activity
2022-06-21 8:00:00 Lisa Working
2022-06-21 8:30:00 Joseph Homework
2022-06-21 8:00:00 Michael Gardening
2022-06-21 9:00:00 Joseph Rowing
2022-06-21 9:00:00 Lisa Working
2022-06-21 9:15:00 Joseph Football
2022-06-21 10:00:00 Joseph Dancing
2022-06-21 10:00:00 Lisa Watering
2022-06-21 10:30:00 Joseph Gaming

I would like to calculate how long each activity for each person lasted. For example create a new column like this:

Timeframe Person Activity Duration
2022-06-21 8:00:00 Lisa Working 01:00:00
2022-06-21 8:30:00 Joseph Homework 00:30:00
2022-06-21 8:00:00 Michael Gardening 01:15:00
2022-06-21 9:00:00 Joseph Rowing 01:00:00
2022-06-21 9:00:00 Lisa Working 01:00:00
2022-06-21 9:15:00 Michael Football 01:45:00
2022-06-21 10:00:00 Joseph Dancing N/A
2022-06-21 10:00:00 Lisa Watering N/A
2022-06-21 10:30:00 Michael Gaming N/A

I need to substract the Timeframe row for each person separatly and create a new column. There is no pause in between. How can it be done in Pyspark or alternatively in Pandas?

Thanks!

CodePudding user response:

We can calculate the time difference in seconds and convert it to the required format.

Using a subset of your data for example.

data_ls = [
    ('2022-06-21 8:00:00', 'Lisa', 'Working'),
    ('2022-06-21 8:30:00', 'Joe', 'HW'),
    ('2022-06-21 8:00:00', 'Mike', 'Gardening'),
    ('2022-06-21 9:00:00', 'Joe', 'Rowing'),
    ('2022-06-21 9:00:00', 'Lisa', 'Working')
]

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['ts', 'name', 'activity']). \
    withColumn('ts', func.col('ts').cast('timestamp'))

#  ------------------- ---- --------- 
# |                 ts|name| activity|
#  ------------------- ---- --------- 
# |2022-06-21 08:00:00|Lisa|  Working|
# |2022-06-21 08:30:00| Joe|       HW|
# |2022-06-21 08:00:00|Mike|Gardening|
# |2022-06-21 09:00:00| Joe|   Rowing|
# |2022-06-21 09:00:00|Lisa|  Working|
#  ------------------- ---- --------- 

We can take a lead() (next timestamp) for each name and subtract it from current timestamp to get the duration in seconds. Using the seconds, we can calculate minutes, hours, or even format it as a time string.

data_sdf. \
    withColumn('duration_sec', 
               func.coalesce(func.lead('ts').over(wd.partitionBy('name').orderBy('ts')).cast('long') - func.col('ts').cast('long'), 
                             func.lit(0)
                             )
               ). \
    withColumn('duration_min', func.col('duration_sec') / 60). \
    withColumn('duration_time', func.from_unixtime('duration_sec', format='HH:mm:ss')). \
    show()

#  ------------------- ---- --------- ------------ ------------ ------------- 
# |                 ts|name| activity|duration_sec|duration_min|duration_time|
#  ------------------- ---- --------- ------------ ------------ ------------- 
# |2022-06-21 08:30:00| Joe|       HW|        1800|        30.0|     00:30:00|
# |2022-06-21 09:00:00| Joe|   Rowing|           0|         0.0|     00:00:00|
# |2022-06-21 08:00:00|Mike|Gardening|           0|         0.0|     00:00:00|
# |2022-06-21 08:00:00|Lisa|  Working|        3600|        60.0|     01:00:00|
# |2022-06-21 09:00:00|Lisa|  Working|           0|         0.0|     00:00:00|
#  ------------------- ---- --------- ------------ ------------ ------------- 
  • Related