Home > database >  Create Column with Lap Times Given Total Time Elapsed
Create Column with Lap Times Given Total Time Elapsed

Time:06-27

I have four different people, each run laps and I need to calculate lap time for each lap for each person. I am given the total elapsed time starting from the very beginning at the end of each lap. What kind of PySpark/SQL/Pandas syntax could I use to calculate lap times efficiently?

Example:

Each row represents one person running one lap.

PersonID Total Time Elapsed (sec) Lap Time (sec)
1 200 200
1 300 100
1 550 250
2 100 100
2 150 50
2 250 100
3 150 150
3 500 350
4 100 100
4 300 200
4 350 50
4 460 110

I need to create the 3rd column, Lap Time, using only the first two columns. I could write a for loop that would eventually work but what is the best/efficient PySpark/SQL/Pandas way to get there?

CodePudding user response:

If you are looking for a pandas solution, one of the approaches could be:

import numpy as np
import pandas as pd
data=[[1,200],[1,300],[1,550],[2,100],[2,150],[2,250],[3,150],[3,500],[4,100],[4,300],[4,350],[4,460]]
df = pd.DataFrame(
data,columns=['PersonID','Total Time Elapsed (sec)']
)
print(df.head(50))
# Peform groupby on PersonID
grouped_df=df.groupby(by='PersonID')
series_list=[]

for name,grp in grouped_df:
  result=grp['Total Time Elapsed (sec)'].diff() # Compute difference between current and prev row
  result=result.fillna(grp['Total Time Elapsed (sec)']) # Fill na with same value in Elapsed col(same row).
  series_list.extend(result.values)

df['Lap Time (sec)']=series_list
df['Lap Time (sec)']=df['Lap Time (sec)'].astype(int) # Changing the datatype
print(df.head(50))

CodePudding user response:

Window functions could do it.

  • PySpark:

    from pyspark.sql import functions as F, Window as W
    df = spark.createDataFrame(
        [(1, 200), (1, 300), (1, 550), (2, 100), (2, 150), (2, 250),
         (3, 150), (3, 500), (4, 100), (4, 300), (4, 350), (4, 460)],
        ['PersonID', 'Total_Time_Elapsed'])
    
    w = W.partitionBy('PersonID').orderBy('Total_Time_Elapsed')
    df = df.withColumn(
        'Lap_Time',
        F.coalesce(
            F.col('Total_Time_Elapsed') - F.lag('Total_Time_Elapsed').over(w),
            'Total_Time_Elapsed'))
    df.show()
    #  -------- ------------------ -------- 
    # |PersonID|Total_Time_Elapsed|Lap_Time|
    #  -------- ------------------ -------- 
    # |       1|               200|     200|
    # |       1|               300|     100|
    # |       1|               550|     250|
    # |       2|               100|     100|
    # |       2|               150|      50|
    # |       2|               250|     100|
    # |       3|               150|     150|
    # |       3|               500|     350|
    # |       4|               100|     100|
    # |       4|               300|     200|
    # |       4|               350|      50|
    # |       4|               460|     110|
    #  -------- ------------------ -------- 
    
  • SQL:

    SELECT
        PersonID,
        Total_Time_Elapsed,
        COALESCE(
            Total_Time_Elapsed - LAG(Total_Time_Elapsed) OVER (PARTITION BY PersonID ORDER BY Total_Time_Elapsed),
            Total_Time_Elapsed) Lap_Time
    FROM df
    
  • Related