I have the following example as a pyspark dataframe:
Timeframe | Person | Activity |
---|---|---|
2022-06-21 8:00:00 | Lisa | Working |
2022-06-21 8:30:00 | Joseph | Homework |
2022-06-21 8:00:00 | Michael | Gardening |
2022-06-21 9:00:00 | Joseph | Rowing |
2022-06-21 9:00:00 | Lisa | Working |
2022-06-21 9:15:00 | Joseph | Football |
2022-06-21 10:00:00 | Joseph | Dancing |
2022-06-21 10:00:00 | Lisa | Watering |
2022-06-21 10:30:00 | Joseph | Gaming |
I would like to calculate how long each activity for each person lasted. For example create a new column like this:
Timeframe | Person | Activity | Duration |
---|---|---|---|
2022-06-21 8:00:00 | Lisa | Working | 01:00:00 |
2022-06-21 8:30:00 | Joseph | Homework | 00:30:00 |
2022-06-21 8:00:00 | Michael | Gardening | 01:15:00 |
2022-06-21 9:00:00 | Joseph | Rowing | 01:00:00 |
2022-06-21 9:00:00 | Lisa | Working | 01:00:00 |
2022-06-21 9:15:00 | Michael | Football | 01:45:00 |
2022-06-21 10:00:00 | Joseph | Dancing | N/A |
2022-06-21 10:00:00 | Lisa | Watering | N/A |
2022-06-21 10:30:00 | Michael | Gaming | N/A |
I need to substract the Timeframe row for each person separatly and create a new column. There is no pause in between. How can it be done in Pyspark or alternatively in Pandas?
Thanks!
CodePudding user response:
We can calculate the time difference in seconds and convert it to the required format.
Using a subset of your data for example.
data_ls = [
('2022-06-21 8:00:00', 'Lisa', 'Working'),
('2022-06-21 8:30:00', 'Joe', 'HW'),
('2022-06-21 8:00:00', 'Mike', 'Gardening'),
('2022-06-21 9:00:00', 'Joe', 'Rowing'),
('2022-06-21 9:00:00', 'Lisa', 'Working')
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['ts', 'name', 'activity']). \
withColumn('ts', func.col('ts').cast('timestamp'))
# ------------------- ---- ---------
# | ts|name| activity|
# ------------------- ---- ---------
# |2022-06-21 08:00:00|Lisa| Working|
# |2022-06-21 08:30:00| Joe| HW|
# |2022-06-21 08:00:00|Mike|Gardening|
# |2022-06-21 09:00:00| Joe| Rowing|
# |2022-06-21 09:00:00|Lisa| Working|
# ------------------- ---- ---------
We can take a lead()
(next timestamp) for each name
and subtract it from current timestamp to get the duration in seconds. Using the seconds, we can calculate minutes, hours, or even format it as a time string.
data_sdf. \
withColumn('duration_sec',
func.coalesce(func.lead('ts').over(wd.partitionBy('name').orderBy('ts')).cast('long') - func.col('ts').cast('long'),
func.lit(0)
)
). \
withColumn('duration_min', func.col('duration_sec') / 60). \
withColumn('duration_time', func.from_unixtime('duration_sec', format='HH:mm:ss')). \
show()
# ------------------- ---- --------- ------------ ------------ -------------
# | ts|name| activity|duration_sec|duration_min|duration_time|
# ------------------- ---- --------- ------------ ------------ -------------
# |2022-06-21 08:30:00| Joe| HW| 1800| 30.0| 00:30:00|
# |2022-06-21 09:00:00| Joe| Rowing| 0| 0.0| 00:00:00|
# |2022-06-21 08:00:00|Mike|Gardening| 0| 0.0| 00:00:00|
# |2022-06-21 08:00:00|Lisa| Working| 3600| 60.0| 01:00:00|
# |2022-06-21 09:00:00|Lisa| Working| 0| 0.0| 00:00:00|
# ------------------- ---- --------- ------------ ------------ -------------