I've 9.5M rows in a DataFrame of following form:
Id | X | Y | Pass_Fail_Status
------- --- --- -----------------
w0 | 0 | 0 | Pass
w0 | 0 | 1 | Fail
...
w1 | 0 | 0 | Fail
...
w6000 | 45| 45| Pass
What is the most efficient way to select subset DataFrame for each "Id" and do processing with that?
As of now I'm doing following:
- I already have set of possible "Id"s from another DataFrame
for id in uniqueIds:
subsetDF = mainDF[mainDF["Id"] == id]
predLabel = predict(subsetDF)
But this seems to have severe performance issue as there're 6.7K such possible id and each repeating 1.4K times. I've done some profiling using cProfile that does not point to this line but I see some scalar op call taking time which is has exact 6.7K call count.
EDIT2: The requirement for the subset-dataframe is that all rows should have same Id - finally for the training or predict 'Id' is not that important but the X,Y location and pass/fail in that location is important.
The subsetDF should be of following form:
Id | X | Y | Pass_Fail_Status
------- --- --- -----------------
w0 | 0 | 0 | Pass
w0 | 0 | 1 | Fail
...
w1399 | 0 | 0 | Fail
...
w1399 |45 |45 | Pass
CodePudding user response:
Conclusion:
According to the result of my experiments, the most efficient way to select a subset DataFrame for each "Id" and do processing with is to get a pair of the start and end index of each "Id" group in advance and to use the .iloc
method to fetch the target rows.
Code (Jupyter Lab):
# Preparation:
import pandas as pd
import numpy as np
# Create a sample dataframe
n = 6700 * 1400 # = 9380000
freq = 1400
mainDF = pd.DataFrame({
'Id': ['w{:04d}'.format(i//freq) for i in range(n)],
'X': np.random.randint(0, 46, n),
'Y': np.random.randint(0, 46, n),
'Pass_Fail_Status': [('Pass', 'Fail')[i] for i in np.random.randint(0, 2, n)]
})
uniqueIds = set(mainDF['Id'])
# Experiments:
# exp 1: apply pandas mask (the OP's method)
df1 = mainDF
def exp_1():
for _id in uniqueIds:
subsetDF = df1[df1['Id'] == _id]
print('Experiment 1:')
%timeit exp_1()
# exp 2: use set_index
df2 = mainDF.set_index('Id')
def exp_2():
for _id in uniqueIds:
subsetDF = df2.loc[_id]
print('Experiment 2:')
%timeit exp_2()
# exp 3: get start&end index and use loc
df3 = mainDF.sort_values('Id', ignore_index=True)
df3_pos_head = df3.groupby('Id')['Id'].head(1).reset_index().set_index('Id')
df3_pos_tail = df3.groupby('Id')['Id'].tail(1).reset_index().set_index('Id')
df3_pos = df3_pos_head.join(df3_pos_tail, lsuffix='_head', rsuffix='_tail')
def exp_3():
for _id in uniqueIds:
subsetDF = df3.loc[df3_pos.loc[_id, 'index_head']:df3_pos.loc[_id, 'index_tail']]
print('Experiment 3:')
%timeit exp_3()
# exp 4: get start&end index and use iloc
df4 = df3
df4_pos = df3_pos
def exp_4():
for _id in uniqueIds:
subsetDF = df4.iloc[df4_pos.loc[_id, 'index_head']:df4_pos.loc[_id, 'index_tail']]
print('Experiment 4:')
%timeit exp_4()
# exp 5: use groupby
df5 = mainDF
df5_groupby = df5.groupby('Id')
def exp_5():
for _, subsetDF in df5_groupby:
pass
print('Experiment 5:')
%timeit exp_5()
# exp 6: use set_index and groupby
df6 = mainDF.set_index('Id')
df6_groupby = df6.groupby('Id')
def exp_6():
for _, subsetDF in df6_groupby:
pass
print('Experiment 6:')
%timeit exp_6()
Output:
Experiment 1: # apply pandas mask (the OP's method)
# More than 10 minutes. I will update this record later.
Experiment 2: # use set_index
235 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Experiment 3: # get start&end index and use loc
216 ms ± 6.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Experiment 4: # get start&end index and use iloc
171 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Experiment 5: # use groupby
548 ms ± 8.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Experiment 6: # use set_index and groupby
497 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sample dataframe:
Id | X | Y | Pass_Fail_Status | |
---|---|---|---|---|
0 | w0000 | 9 | 28 | Fail |
1 | w0000 | 42 | 28 | Pass |
2 | w0000 | 26 | 36 | Pass |
9379997 | w6699 | 12 | 14 | Fail |
9379998 | w6699 | 8 | 40 | Fail |
9379999 | w6699 | 17 | 21 | Pass |
df3_pos & df4_pos:
Id | index_head | index_tail |
---|---|---|
w0000 | 0 | 1399 |
w0001 | 1400 | 2799 |
w0002 | 2800 | 4199 |
w6697 | 9375800 | 9377199 |
w6698 | 9377200 | 9378599 |
w6699 | 9378600 | 9379999 |
Note:
Personally, I would use set_index
(experiment 2) because it is very simple and there is a little difference among the top 3 fastest methods.
CodePudding user response:
IIUC, you could use groupby
sample
to randomly sample a certain fraction of the original df
to split into train
and test
DataFrames:
train = df.groupby('Id').sample(frac=0.7)
test = df[~df.index.isin(train.index)]
For example, in the sample you have in the OP, the above code produces:
train
:
Id X Y Pass_Fail_Status
0 w0 0 0 Pass
2 w1 0 0 Fail
3 w6000 45 45 Pass
test
:
Id X Y Pass_Fail_Status
1 w0 0 1 Fail
CodePudding user response:
I have tried the suggestion for 'groupby' by 'enke' and it improved the performance by 6X (I measured 3 times each and this is based on avg) - now the for loop is like following:
for id, subsetDF in mainDF.groupby("Id", as_index=False):
predLabel = predict(subsetDF)