I have a following dataframe,
MachineId Timestamp Events EventValue CycleEnd UniqueId
7 2021-11-01 20:45:17 Attr-A 50 0 0
7 2021-11-01 20:45:44 Attr-B 1000 0 0
7 2021-11-01 21:00:00 Attr-C 5 0 0
7 2021-11-01 21:03:36 End NULL 1 0
7 2021-11-01 21:11:43 Attr-B 1100 0 1
7 2021-11-01 21:11:44 Attr-C 2 0 1
7 2021-11-01 21:25:01 End NULL 1 1
The resultant dataframe should look like,
MachineId CycleId CycleStarttime CycleEndtime Attr-A Attr-B Attr-C End
7 1 2021-11-01 20:45:17 2021-11-01 21:03:36 50 1000 5 NULL
7 2 2021-11-01 21:11:43 2021-11-01 21:25:01 NULL 1100 2 NULL
The CycleEnd
column is related to the Events
column. If its 1, then it means that a cycle completed.
I would like to sequentially pivot the dataframe for each cycle-end, so that I can get the values of Events
as column with its corresponding value from the EventValue
column.
My approach was to create a UniqueId
column from the CycleEnd
column by using df['UniqueId'] = df['CycleEnd'].eq(1).shift().bfill().cumsum()
which defines each cycle individually. And then, loop over each unique Id and filter the dataframe and then store the required information to a temporary new dataframe and then append to a list. Finally concatenate.
I would like to know if there is any other more efficient (performance wise) approach for the above problem. The dataframe has hundreds of thousands of rows. Thank you!
CodePudding user response:
I think here is possible aggregate min
and max
datetimes per groups, then pivoting by DataFrame.pivot_table
with aggregate function like mean
, sum
and last join together:
df['UniqueId'] = df['CycleEnd'].eq(1).shift().bfill().cumsum().add(1)
df1 = (df.groupby(['MachineId','UniqueId'])
.agg(CycleStarttime=('Timestamp','min'), CycleEndtime=('Timestamp','max')))
df2 = df.pivot_table(index=['MachineId','UniqueId'],
columns='Events',
values='EventValue',
aggfunc='sum')
df = df1.join(df2).rename_axis(['MachineId','CycleId']).reset_index()
print (df)
MachineId CycleId CycleStarttime CycleEndtime Attr-A Attr-B \
0 7 1 2021-11-01 20:45:17 2021-11-01 21:03:36 50.0 1000.0
1 7 2 2021-11-01 21:11:43 2021-11-01 21:25:01 NaN 1100.0
Attr-C End
0 5.0 NaN
1 2.0 NaN
CodePudding user response:
Fortunately, your DataFrame has UniqueId column, which facilitates grouping.
To do your task, define a function to process group of source rows as:
def grpProc(grp):
rv1 = pd.Series([grp.MachineId.iloc[0], grp.UniqueId.iloc[0] 1,
grp.Timestamp.iloc[0], grp.Timestamp.iloc[-1]],
index=['MachineId', 'CycleId', 'CycleStarttime', 'CycleEndtime'])
rv2 = grp[:-1].pivot(index='MachineId', columns='Events',
values='EventValue').iloc[0]
return pd.concat([rv1, rv2])
Then run:
wrk = df.groupby('UniqueId').apply(grpProc)
result = wrk.unstack().reindex(columns=wrk[0].index)
result.index.name=None
For your source data the result is:
MachineId CycleId CycleStarttime CycleEndtime Attr-A Attr-B Attr-C
0 7 1 2021-11-01 20:45:17 2021-11-01 21:03:36 50 1000 5
1 7 2 2021-11-01 21:11:43 2021-11-01 21:25:01 NaN 1100 2