I have four different people, each run laps and I need to calculate lap time for each lap for each person. I am given the total elapsed time starting from the very beginning at the end of each lap. What kind of PySpark/SQL/Pandas syntax could I use to calculate lap times efficiently?
Example:
Each row represents one person running one lap.
PersonID | Total Time Elapsed (sec) | Lap Time (sec) |
---|---|---|
1 | 200 | 200 |
1 | 300 | 100 |
1 | 550 | 250 |
2 | 100 | 100 |
2 | 150 | 50 |
2 | 250 | 100 |
3 | 150 | 150 |
3 | 500 | 350 |
4 | 100 | 100 |
4 | 300 | 200 |
4 | 350 | 50 |
4 | 460 | 110 |
I need to create the 3rd column, Lap Time, using only the first two columns. I could write a for loop that would eventually work but what is the best/efficient PySpark/SQL/Pandas way to get there?
CodePudding user response:
If you are looking for a pandas solution, one of the approaches could be:
import numpy as np
import pandas as pd
data=[[1,200],[1,300],[1,550],[2,100],[2,150],[2,250],[3,150],[3,500],[4,100],[4,300],[4,350],[4,460]]
df = pd.DataFrame(
data,columns=['PersonID','Total Time Elapsed (sec)']
)
print(df.head(50))
# Peform groupby on PersonID
grouped_df=df.groupby(by='PersonID')
series_list=[]
for name,grp in grouped_df:
result=grp['Total Time Elapsed (sec)'].diff() # Compute difference between current and prev row
result=result.fillna(grp['Total Time Elapsed (sec)']) # Fill na with same value in Elapsed col(same row).
series_list.extend(result.values)
df['Lap Time (sec)']=series_list
df['Lap Time (sec)']=df['Lap Time (sec)'].astype(int) # Changing the datatype
print(df.head(50))
CodePudding user response:
Window functions could do it.
PySpark:
from pyspark.sql import functions as F, Window as W df = spark.createDataFrame( [(1, 200), (1, 300), (1, 550), (2, 100), (2, 150), (2, 250), (3, 150), (3, 500), (4, 100), (4, 300), (4, 350), (4, 460)], ['PersonID', 'Total_Time_Elapsed']) w = W.partitionBy('PersonID').orderBy('Total_Time_Elapsed') df = df.withColumn( 'Lap_Time', F.coalesce( F.col('Total_Time_Elapsed') - F.lag('Total_Time_Elapsed').over(w), 'Total_Time_Elapsed')) df.show() # -------- ------------------ -------- # |PersonID|Total_Time_Elapsed|Lap_Time| # -------- ------------------ -------- # | 1| 200| 200| # | 1| 300| 100| # | 1| 550| 250| # | 2| 100| 100| # | 2| 150| 50| # | 2| 250| 100| # | 3| 150| 150| # | 3| 500| 350| # | 4| 100| 100| # | 4| 300| 200| # | 4| 350| 50| # | 4| 460| 110| # -------- ------------------ --------
SQL:
SELECT PersonID, Total_Time_Elapsed, COALESCE( Total_Time_Elapsed - LAG(Total_Time_Elapsed) OVER (PARTITION BY PersonID ORDER BY Total_Time_Elapsed), Total_Time_Elapsed) Lap_Time FROM df