Select rows by variable condition (ie. desired value in each row depends on values in other rows)-CodePudding

I have a following DataFrame:

    model_year  cylinders   mpg
0   70  4   25.285714
1   70  6   20.500000
2   70  8   14.111111
3   71  4   27.461538
4   71  6   18.000000
5   71  8   13.428571
6   72  3   19.000000
7   72  4   23.428571
8   72  8   13.615385
9   73  3   18.000000
10  73  4   22.727273
11  73  6   19.000000
12  73  8   13.200000
13  74  4   27.800000
14  74  6   17.857143
15  74  8   14.200000
16  75  4   25.250000
17  75  6   17.583333
18  75  8   15.666667
19  76  4   26.766667
20  76  6   20.000000
21  76  8   14.666667
22  77  3   21.500000
23  77  4   29.107143
24  77  6   19.500000
25  77  8   16.000000
26  78  4   29.576471
27  78  5   20.300000
28  78  6   19.066667
29  78  8   19.050000
30  79  4   31.525000
31  79  5   25.400000
32  79  6   22.950000
33  79  8   18.630000
34  80  3   23.700000
35  80  4   34.612000
36  80  5   36.400000
37  80  6   25.900000
38  81  4   32.814286
39  81  6   23.428571
40  81  8   26.600000
41  82  4   32.071429
42  82  6   28.333333

I want to select rows that fulfill the following condition: For each model_year select a row with minimal value of cylinders in that year.

So, for instance, for model years = 70, 71, 72 and 73 I want to get:

    model_year  cylinders   mpg
0   70  4   25.285714
3   71  4   27.461538
6   72  3   19.000000
9   73  3   18.000000

My most advanced attempt consisted of this:

I converted model_year and cylinders column into MultiIndex of the DataFrame
Using (among others) groupby method I've obtained MultiIndex object of rows I'd like to select.

However, I couldn't find a way to select rows using MultiIndex object.

For reference the MultiIndex I've obtained is:

MultiIndex([(70, 4),
            (71, 4),
            (72, 3),
            (73, 3),
            (74, 4),
            (75, 4),
            (76, 4),
            (77, 3),
            (78, 4),
            (79, 4),
            (80, 3),
            (81, 4),
            (82, 4)],
           names=['model_year', 'cylinders'])

CodePudding user response：

You could use groupby idxmin to create a mask and filter df with it:

out = df.loc[df.groupby('model_year')['cylinders'].idxmin()]

Output:

    model_year  cylinders        mpg
0           70          4  25.285714
3           71          4  27.461538
6           72          3  19.000000
9           73          3  18.000000
13          74          4  27.800000
16          75          4  25.250000
19          76          4  26.766667
22          77          3  21.500000
26          78          4  29.576471
30          79          4  31.525000
34          80          3  23.700000
38          81          4  32.814286
41          82          4  32.071429

CodePudding user response：

I think a simpler solution would actually be to use groupby transform:

selected = df[df['cylinders'] == df.groupby('model_year')['cylinders'].transform('min')]

Output:

>>> selected
    model_year  cylinders        mpg
0           70          4  25.285714
3           71          4  27.461538
6           72          3  19.000000
9           73          3  18.000000
13          74          4  27.800000
16          75          4  25.250000
19          76          4  26.766667
22          77          3  21.500000
26          78          4  29.576471
30          79          4  31.525000
34          80          3  23.700000
38          81          4  32.814286
41          82          4  32.071429

(Note that if there are multiple minimums for a group (e.g. for model_year 70 there are two 4-cylinder rows), they will be included in the output.)

CodePudding user response：

You can just try

out = df.sort_values('cylinders',ascending=False).drop_duplicates('model_year')