I have a following DataFrame:
model_year cylinders mpg
0 70 4 25.285714
1 70 6 20.500000
2 70 8 14.111111
3 71 4 27.461538
4 71 6 18.000000
5 71 8 13.428571
6 72 3 19.000000
7 72 4 23.428571
8 72 8 13.615385
9 73 3 18.000000
10 73 4 22.727273
11 73 6 19.000000
12 73 8 13.200000
13 74 4 27.800000
14 74 6 17.857143
15 74 8 14.200000
16 75 4 25.250000
17 75 6 17.583333
18 75 8 15.666667
19 76 4 26.766667
20 76 6 20.000000
21 76 8 14.666667
22 77 3 21.500000
23 77 4 29.107143
24 77 6 19.500000
25 77 8 16.000000
26 78 4 29.576471
27 78 5 20.300000
28 78 6 19.066667
29 78 8 19.050000
30 79 4 31.525000
31 79 5 25.400000
32 79 6 22.950000
33 79 8 18.630000
34 80 3 23.700000
35 80 4 34.612000
36 80 5 36.400000
37 80 6 25.900000
38 81 4 32.814286
39 81 6 23.428571
40 81 8 26.600000
41 82 4 32.071429
42 82 6 28.333333
I want to select rows that fulfill the following condition: For each model_year select a row with minimal value of cylinders in that year.
So, for instance, for model years = 70, 71, 72 and 73 I want to get:
model_year cylinders mpg
0 70 4 25.285714
3 71 4 27.461538
6 72 3 19.000000
9 73 3 18.000000
My most advanced attempt consisted of this:
- I converted
model_year
andcylinders
column into MultiIndex of the DataFrame - Using (among others)
groupby
method I've obtained MultiIndex object of rows I'd like to select.
However, I couldn't find a way to select rows using MultiIndex object.
For reference the MultiIndex I've obtained is:
MultiIndex([(70, 4),
(71, 4),
(72, 3),
(73, 3),
(74, 4),
(75, 4),
(76, 4),
(77, 3),
(78, 4),
(79, 4),
(80, 3),
(81, 4),
(82, 4)],
names=['model_year', 'cylinders'])
CodePudding user response:
You could use groupby
idxmin
to create a mask and filter df
with it:
out = df.loc[df.groupby('model_year')['cylinders'].idxmin()]
Output:
model_year cylinders mpg
0 70 4 25.285714
3 71 4 27.461538
6 72 3 19.000000
9 73 3 18.000000
13 74 4 27.800000
16 75 4 25.250000
19 76 4 26.766667
22 77 3 21.500000
26 78 4 29.576471
30 79 4 31.525000
34 80 3 23.700000
38 81 4 32.814286
41 82 4 32.071429
CodePudding user response:
I think a simpler solution would actually be to use groupby
transform
:
selected = df[df['cylinders'] == df.groupby('model_year')['cylinders'].transform('min')]
Output:
>>> selected
model_year cylinders mpg
0 70 4 25.285714
3 71 4 27.461538
6 72 3 19.000000
9 73 3 18.000000
13 74 4 27.800000
16 75 4 25.250000
19 76 4 26.766667
22 77 3 21.500000
26 78 4 29.576471
30 79 4 31.525000
34 80 3 23.700000
38 81 4 32.814286
41 82 4 32.071429
(Note that if there are multiple minimums for a group (e.g. for model_year 70 there are two 4-cylinder rows), they will be included in the output.)
CodePudding user response:
You can just try
out = df.sort_values('cylinders',ascending=False).drop_duplicates('model_year')