Create Column with Lap Times Given Total Time Elapsed-CodePudding

I have four different people, each run laps and I need to calculate lap time for each lap for each person. I am given the total elapsed time starting from the very beginning at the end of each lap. What kind of PySpark/SQL/Pandas syntax could I use to calculate lap times efficiently?

Example:

Each row represents one person running one lap.

PersonID	Total Time Elapsed (sec)	Lap Time (sec)
1	200	200
1	300	100
1	550	250
2	100	100
2	150	50
2	250	100
3	150	150
3	500	350
4	100	100
4	300	200
4	350	50
4	460	110

I need to create the 3rd column, Lap Time, using only the first two columns. I could write a for loop that would eventually work but what is the best/efficient PySpark/SQL/Pandas way to get there?

CodePudding user response：

If you are looking for a pandas solution, one of the approaches could be:

import numpy as np
import pandas as pd
data=[[1,200],[1,300],[1,550],[2,100],[2,150],[2,250],[3,150],[3,500],[4,100],[4,300],[4,350],[4,460]]
df = pd.DataFrame(
data,columns=['PersonID','Total Time Elapsed (sec)']
)
print(df.head(50))
# Peform groupby on PersonID
grouped_df=df.groupby(by='PersonID')
series_list=[]

for name,grp in grouped_df:
  result=grp['Total Time Elapsed (sec)'].diff() # Compute difference between current and prev row
  result=result.fillna(grp['Total Time Elapsed (sec)']) # Fill na with same value in Elapsed col(same row).
  series_list.extend(result.values)

df['Lap Time (sec)']=series_list
df['Lap Time (sec)']=df['Lap Time (sec)'].astype(int) # Changing the datatype
print(df.head(50))

CodePudding user response：

Window functions could do it.

PySpark:

from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
    [(1, 200), (1, 300), (1, 550), (2, 100), (2, 150), (2, 250),
     (3, 150), (3, 500), (4, 100), (4, 300), (4, 350), (4, 460)],
    ['PersonID', 'Total_Time_Elapsed'])

w = W.partitionBy('PersonID').orderBy('Total_Time_Elapsed')
df = df.withColumn(
    'Lap_Time',
    F.coalesce(
        F.col('Total_Time_Elapsed') - F.lag('Total_Time_Elapsed').over(w),
        'Total_Time_Elapsed'))
df.show()
#  -------- ------------------ -------- 
# |PersonID|Total_Time_Elapsed|Lap_Time|
#  -------- ------------------ -------- 
# |       1|               200|     200|
# |       1|               300|     100|
# |       1|               550|     250|
# |       2|               100|     100|
# |       2|               150|      50|
# |       2|               250|     100|
# |       3|               150|     150|
# |       3|               500|     350|
# |       4|               100|     100|
# |       4|               300|     200|
# |       4|               350|      50|
# |       4|               460|     110|
#  -------- ------------------ --------

SQL:

SELECT
    PersonID,
    Total_Time_Elapsed,
    COALESCE(
        Total_Time_Elapsed - LAG(Total_Time_Elapsed) OVER (PARTITION BY PersonID ORDER BY Total_Time_Elapsed),
        Total_Time_Elapsed) Lap_Time
FROM df

PersonID	Total Time Elapsed (sec)	Lap Time (sec)
1	200	200
1	300	100
1	550	250
2	100	100
2	150	50
2	250	100
3	150	150
3	500	350
4	100	100
4	300	200
4	350	50
4	460	110

PersonID	Total Time Elapsed (sec)	Lap Time (sec)
1	200	200
1	300	100
1	550	250
2	100	100
2	150	50
2	250	100
3	150	150
3	500	350
4	100	100
4	300	200
4	350	50
4	460	110

PersonID	Total Time Elapsed (sec)	Lap Time (sec)
1	200	200
1	300	100
1	550	250
2	100	100
2	150	50
2	250	100
3	150	150
3	500	350
4	100	100
4	300	200
4	350	50
4	460	110