here are some lines from my DataFrame composed as follows :
Index ['TimeStamp'] ['ThreadID'] ['Start|End'] ['StepIdentifier']
1017 18:44:22,997 [12] Start FetchMasterData
1018 18:44:22,997 [12] Start GetSmrCalculatedMeterData
1020 18:44:22,997 [12] End GetSmrCalculatedMeterData
1021 18:44:22,997 [12] Start GetSmrPhysicalMeterData
1023 18:44:23,013 [12] End GetSmrPhysicalMeterData
1024 18:44:23,013 [12] Start GetSmrMarketDeliveryPointData
1026 18:44:23,013 [12] End GetSmrMarketDeliveryPointData
1027 18:44:23,013 [12] Start GetSmrMarketHeadpointData
1029 18:44:23,013 [12] End GetSmrMarketHeadpointData
1030 18:44:23,013 [12] End FetchMasterData
I need for each Thread to get the Start and End time for each process. We can see that the "Start" and the "End" for a process (like FetchMasterData) do not necessarily follow each other.
In order to create a new DataFrame like this:
['ThreadID'] ['StepIdentifier'] ['Start'] ['End']
[12] FetchMasterData 18:44:22,997 18:44:23,013
I tried to compare a line that has a process start with all the following ones until I find the end of the process, then to create a line in the new dataframe :
def iterateDfStartEnd(df):
df = createDataFrame() // Function to create the DataFrame shown above
new_df = pd.DataFrame(columns=['ThreadID','StepIdentifier','StartTime','EndTime'])
j = 1013 // Because i starts at 1012(like j = i 1) and i ends at 1140
for i in df.index
while(df.iloc[i,'End|Start'] == "Start" & df.iloc[j,'End|Start'] == "End"
& df.iloc[i,'StepIdentifier'] == df.iloc[j,'StepIdentifier']
& df.iloc[i,'ThreadID'] == df.iloc[j,'ThreadID'] & j < 1141):
j = 1
new_df['ThreadID'] = df.iloc[i,'ThreadID']
new_df['StepIdentifier'] = df.iloc[i,'StepIdentifier']
new_df['StartTime'] = df.iloc[i,'TimeStamp']
new_df['EndTime'] = df.iloc[j,'TimeStamp']
j = i 1
return new_df
For finally calculate for each process the time between the operations Start and End and to have a DataFrame like this (just for your information) :
['ThreadID'] ['StepIdentifier'] ['Time']
[12] FetchMasterData 16s
I don't understand how to iterate the dataframe by comparing rows as I do, I have errors with the indexes. Does anyone have another way to create the new dataframe ? Thanks everyone !
CodePudding user response:
There's a fairly straightforward solution if your dataframe always has exactly one Start and one End time. You do a .groupby()
on the ThreadId and the StepIdentifier, and then use the minimum of TimeStamp for Start and the maximum for End:
df = pd.DataFrame({'ts': ['18','19','20','21','22','23'],
'id': ['12','12','12','12','12','12'],
'se': ['Start', 'Start', 'End', 'Start', 'End', 'End'],
'event': ['stomp','cramp','stomp','break','cramp','break']})
g = df.groupby(["id",'event']).agg(Start = ('ts', lambda x: min(x)), End = ('ts', lambda x: max(x)))
g.reset_index()
The output looks like:
id event Start End
0 12 break 21 23
1 12 cramp 19 22
2 12 stomp 18 20
CodePudding user response:
You could approach this with a pivot_table
df.pivot_table(
index=['ThreadID', 'StepIdentifier'],
columns='Start|End',
values='TimeStamp',
aggfunc='first',
fill_value=np.nan
).rename_axis(None, axis=1).reset_index()