Efficient way to select subset dataframe-CodePudding

I've 9.5M rows in a DataFrame of following form:

    Id | X | Y | Pass_Fail_Status
------- --- --- -----------------
    w0 | 0 | 0 | Pass
    w0 | 0 | 1 | Fail
...
    w1 | 0 | 0 | Fail
...
 w6000 | 45| 45| Pass

What is the most efficient way to select subset DataFrame for each "Id" and do processing with that?

As of now I'm doing following:

I already have set of possible "Id"s from another DataFrame

for id in uniqueIds:
    subsetDF = mainDF[mainDF["Id"] == id]
    predLabel = predict(subsetDF)

But this seems to have severe performance issue as there're 6.7K such possible id and each repeating 1.4K times. I've done some profiling using cProfile that does not point to this line but I see some scalar op call taking time which is has exact 6.7K call count.

EDIT2: The requirement for the subset-dataframe is that all rows should have same Id - finally for the training or predict 'Id' is not that important but the X,Y location and pass/fail in that location is important.

The subsetDF should be of following form:

        Id | X | Y | Pass_Fail_Status
    ------- --- --- -----------------
        w0 | 0 | 0 | Pass
        w0 | 0 | 1 | Fail
       ...
     w1399 | 0 | 0 | Fail
       ...
     w1399 |45 |45 | Pass

CodePudding user response：

Conclusion:

According to the result of my experiments, the most efficient way to select a subset DataFrame for each "Id" and do processing with is to get a pair of the start and end index of each "Id" group in advance and to use the .iloc method to fetch the target rows.

Code (Jupyter Lab):

# Preparation:

import pandas as pd
import numpy as np

# Create a sample dataframe
n = 6700 * 1400 # = 9380000
freq = 1400
mainDF = pd.DataFrame({
    'Id': ['w{:04d}'.format(i//freq) for i in range(n)],
    'X': np.random.randint(0, 46, n),
    'Y': np.random.randint(0, 46, n),
    'Pass_Fail_Status': [('Pass', 'Fail')[i] for i in np.random.randint(0, 2, n)]
})
uniqueIds = set(mainDF['Id'])

# Experiments:

# exp 1: apply pandas mask (the OP's method)
df1 = mainDF
def exp_1():
    for _id in uniqueIds:
        subsetDF = df1[df1['Id'] == _id]
print('Experiment 1:')
%timeit exp_1()

# exp 2: use set_index
df2 = mainDF.set_index('Id')
def exp_2():
    for _id in uniqueIds:
        subsetDF = df2.loc[_id]
print('Experiment 2:')
%timeit exp_2()

# exp 3: get start&end index and use loc
df3 = mainDF.sort_values('Id', ignore_index=True)
df3_pos_head = df3.groupby('Id')['Id'].head(1).reset_index().set_index('Id')
df3_pos_tail = df3.groupby('Id')['Id'].tail(1).reset_index().set_index('Id')
df3_pos = df3_pos_head.join(df3_pos_tail, lsuffix='_head', rsuffix='_tail')
def exp_3():
    for _id in uniqueIds:
        subsetDF = df3.loc[df3_pos.loc[_id, 'index_head']:df3_pos.loc[_id, 'index_tail']]
print('Experiment 3:')
%timeit exp_3()

# exp 4: get start&end index and use iloc
df4 = df3
df4_pos = df3_pos
def exp_4():
    for _id in uniqueIds:
        subsetDF = df4.iloc[df4_pos.loc[_id, 'index_head']:df4_pos.loc[_id, 'index_tail']]
print('Experiment 4:')
%timeit exp_4()

# exp 5: use groupby
df5 = mainDF
df5_groupby = df5.groupby('Id')
def exp_5():
    for _, subsetDF in df5_groupby:
        pass
print('Experiment 5:')
%timeit exp_5()

# exp 6: use set_index and groupby
df6 = mainDF.set_index('Id')
df6_groupby = df6.groupby('Id')
def exp_6():
    for _, subsetDF in df6_groupby:
        pass
print('Experiment 6:')
%timeit exp_6()

Output:

Experiment 1: # apply pandas mask (the OP's method)
# More than 10 minutes. I will update this record later.

Experiment 2: # use set_index
235 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Experiment 3: # get start&end index and use loc
216 ms ± 6.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Experiment 4: # get start&end index and use iloc
171 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Experiment 5: # use groupby
548 ms ± 8.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Experiment 6: # use set_index and groupby
497 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sample dataframe:

	Id	X	Y	Pass_Fail_Status
0	w0000	9	28	Fail
1	w0000	42	28	Pass
2	w0000	26	36	Pass

9379997	w6699	12	14	Fail
9379998	w6699	8	40	Fail
9379999	w6699	17	21	Pass

df3_pos & df4_pos:

Id	index_head	index_tail
w0000	0	1399
w0001	1400	2799
w0002	2800	4199

w6697	9375800	9377199
w6698	9377200	9378599
w6699	9378600	9379999

Note:

Personally, I would use set_index (experiment 2) because it is very simple and there is a little difference among the top 3 fastest methods.

CodePudding user response：

IIUC, you could use groupby sample to randomly sample a certain fraction of the original df to split into train and test DataFrames:

train = df.groupby('Id').sample(frac=0.7)
test = df[~df.index.isin(train.index)]

For example, in the sample you have in the OP, the above code produces:

train:

      Id   X   Y Pass_Fail_Status
0     w0   0   0             Pass
2     w1   0   0             Fail
3  w6000  45  45             Pass

test:

   Id  X  Y Pass_Fail_Status
1  w0  0  1             Fail

CodePudding user response：

I have tried the suggestion for 'groupby' by 'enke' and it improved the performance by 6X (I measured 3 times each and this is based on avg) - now the for loop is like following:

   for id, subsetDF in mainDF.groupby("Id", as_index=False):
        predLabel = predict(subsetDF)