Suppose I have the following dataframe being called datafr
:
ID Step Category Price
1 1 A 500
1 2 A 600
1 3 A 700
1 4 B 900
2 2 C 900
2 1 C 1000
2 4 C 1200
2 3 C 1400
3 2 C 1600
3 3 C 200
4 1 A 100
I now want to select only those IDs, which contain a 1 for column "Step". So the final dataframe should look like this:
ID Step Category Price
1 1 A 500
1 2 A 600
1 3 A 700
1 4 B 900
2 2 C 900
2 1 C 1000
2 4 C 1200
2 3 C 1400
4 1 A 100
So the ID 3 should not be included, as it has no record with step equal to 1.
My approach is to first select those entries which contain step 1 and save it to another dataframe. Then merge this back, so that I have it as an extra column on which I can filter:
datafr_step1=datafr.loc[datafr['Step'] == 1]
df_step1_renamed=datafr_step1.rename(columns={'Step': 'Step_1_values'})
print(df_step1_renamed)
df_merged=pd.merge(datafr,df_step1_renamed[['ID','Step_1_values']],on='ID',how='left')
print(df_merged)
final_df=df_merged.loc[df_merged['Step_1_values'] == 1]
print(final_df)
Output:
ID Step_1_values Category Price
0 1 1 A 500
5 2 1 C 1000
10 4 1 A 100
ID Step Category Price Step_1_values
0 1 1 A 500 1.0
1 1 2 A 600 1.0
2 1 3 A 700 1.0
3 1 4 B 900 1.0
4 2 2 C 900 1.0
5 2 1 C 1000 1.0
6 2 4 C 1200 1.0
7 2 3 C 1400 1.0
8 3 2 C 1600 NaN
9 3 3 C 200 NaN
10 4 1 A 100 1.0
ID Step Category Price Step_1_values
0 1 1 A 500 1.0
1 1 2 A 600 1.0
2 1 3 A 700 1.0
3 1 4 B 900 1.0
4 2 2 C 900 1.0
5 2 1 C 1000 1.0
6 2 4 C 1200 1.0
7 2 3 C 1400 1.0
10 4 1 A 100 1.0
It works, but I think this is not a good approach. I have a quite large dataframe, so I would be interested in an efficient solution.
CodePudding user response:
If performance is important avoid use groupby
, filter values by Series.isin
for ID
with 1
in boolean indexing
:
df = datafr[datafr['ID'].isin(datafr.loc[datafr['Step'] == 1, 'ID'].unique())]
print (df)
ID Step Category Price
0 1 1 A 500
1 1 2 A 600
2 1 3 A 700
3 1 4 B 900
4 2 2 C 900
5 2 1 C 1000
6 2 4 C 1200
7 2 3 C 1400
10 4 1 A 100
Details:
print (datafr.loc[datafr['Step'] == 1, 'ID'])
0 1
5 2
10 4
Name: ID, dtype: int64
print (datafr['ID'].isin(datafr.loc[datafr['Step'] == 1, 'ID']))
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 False
9 False
10 True
Name: ID, dtype: bool
CodePudding user response:
You can use GroupBy.transform
with any
to select a group if there is any 1 in it:
datafr[datafr['Step'].eq(1).groupby(df['ID']).transform('any')]
output:
ID Step Category Price
0 1 1 A 500
1 1 2 A 600
2 1 3 A 700
3 1 4 B 900
4 2 2 C 900
5 2 1 C 1000
6 2 4 C 1200
7 2 3 C 1400
10 4 1 A 100