Home > Software engineering >  Pandas dataframe: Efficiently select cases
Pandas dataframe: Efficiently select cases

Time:05-05

Suppose I have the following dataframe being called datafr:

ID  Step Category  Price
1     1        A    500
1     2        A    600
1     3        A    700
1     4        B    900
2     2        C    900
2     1        C   1000
2     4        C   1200
2     3        C   1400
3     2        C   1600
3     3        C    200
4     1        A    100

I now want to select only those IDs, which contain a 1 for column "Step". So the final dataframe should look like this:

ID  Step Category  Price
1     1        A    500
1     2        A    600
1     3        A    700
1     4        B    900
2     2        C    900
2     1        C   1000
2     4        C   1200
2     3        C   1400
4     1        A    100

So the ID 3 should not be included, as it has no record with step equal to 1.

My approach is to first select those entries which contain step 1 and save it to another dataframe. Then merge this back, so that I have it as an extra column on which I can filter:

datafr_step1=datafr.loc[datafr['Step'] == 1]
df_step1_renamed=datafr_step1.rename(columns={'Step': 'Step_1_values'})
print(df_step1_renamed)
df_merged=pd.merge(datafr,df_step1_renamed[['ID','Step_1_values']],on='ID',how='left')
print(df_merged)
final_df=df_merged.loc[df_merged['Step_1_values'] == 1]
print(final_df)

Output:

   ID  Step_1_values Category  Price
0    1              1        A    500
5    2              1        C   1000
10   4              1        A    100
    ID  Step Category  Price  Step_1_values
0    1     1        A    500            1.0
1    1     2        A    600            1.0
2    1     3        A    700            1.0
3    1     4        B    900            1.0
4    2     2        C    900            1.0
5    2     1        C   1000            1.0
6    2     4        C   1200            1.0
7    2     3        C   1400            1.0
8    3     2        C   1600            NaN
9    3     3        C    200            NaN
10   4     1        A    100            1.0
    ID  Step Category  Price  Step_1_values
0    1     1        A    500            1.0
1    1     2        A    600            1.0
2    1     3        A    700            1.0
3    1     4        B    900            1.0
4    2     2        C    900            1.0
5    2     1        C   1000            1.0
6    2     4        C   1200            1.0
7    2     3        C   1400            1.0
10   4     1        A    100            1.0

It works, but I think this is not a good approach. I have a quite large dataframe, so I would be interested in an efficient solution.

CodePudding user response:

If performance is important avoid use groupby, filter values by Series.isin for ID with 1 in boolean indexing:

df = datafr[datafr['ID'].isin(datafr.loc[datafr['Step'] == 1, 'ID'].unique())]
print (df)
    ID  Step Category  Price
0    1     1        A    500
1    1     2        A    600
2    1     3        A    700
3    1     4        B    900
4    2     2        C    900
5    2     1        C   1000
6    2     4        C   1200
7    2     3        C   1400
10   4     1        A    100

Details:

print (datafr.loc[datafr['Step'] == 1, 'ID'])
0     1
5     2
10    4
Name: ID, dtype: int64

print (datafr['ID'].isin(datafr.loc[datafr['Step'] == 1, 'ID']))
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8     False
9     False
10     True
Name: ID, dtype: bool

CodePudding user response:

You can use GroupBy.transform with any to select a group if there is any 1 in it:

datafr[datafr['Step'].eq(1).groupby(df['ID']).transform('any')]

output:

    ID  Step Category  Price
0    1     1        A    500
1    1     2        A    600
2    1     3        A    700
3    1     4        B    900
4    2     2        C    900
5    2     1        C   1000
6    2     4        C   1200
7    2     3        C   1400
10   4     1        A    100
  • Related