Home > Net >  Select rows by variable condition (ie. desired value in each row depends on values in other rows)
Select rows by variable condition (ie. desired value in each row depends on values in other rows)

Time:02-22

I have a following DataFrame:

    model_year  cylinders   mpg
0   70  4   25.285714
1   70  6   20.500000
2   70  8   14.111111
3   71  4   27.461538
4   71  6   18.000000
5   71  8   13.428571
6   72  3   19.000000
7   72  4   23.428571
8   72  8   13.615385
9   73  3   18.000000
10  73  4   22.727273
11  73  6   19.000000
12  73  8   13.200000
13  74  4   27.800000
14  74  6   17.857143
15  74  8   14.200000
16  75  4   25.250000
17  75  6   17.583333
18  75  8   15.666667
19  76  4   26.766667
20  76  6   20.000000
21  76  8   14.666667
22  77  3   21.500000
23  77  4   29.107143
24  77  6   19.500000
25  77  8   16.000000
26  78  4   29.576471
27  78  5   20.300000
28  78  6   19.066667
29  78  8   19.050000
30  79  4   31.525000
31  79  5   25.400000
32  79  6   22.950000
33  79  8   18.630000
34  80  3   23.700000
35  80  4   34.612000
36  80  5   36.400000
37  80  6   25.900000
38  81  4   32.814286
39  81  6   23.428571
40  81  8   26.600000
41  82  4   32.071429
42  82  6   28.333333

I want to select rows that fulfill the following condition: For each model_year select a row with minimal value of cylinders in that year.

So, for instance, for model years = 70, 71, 72 and 73 I want to get:

    model_year  cylinders   mpg
0   70  4   25.285714
3   71  4   27.461538
6   72  3   19.000000
9   73  3   18.000000

My most advanced attempt consisted of this:

  1. I converted model_year and cylinders column into MultiIndex of the DataFrame
  2. Using (among others) groupby method I've obtained MultiIndex object of rows I'd like to select.

However, I couldn't find a way to select rows using MultiIndex object.

For reference the MultiIndex I've obtained is:

MultiIndex([(70, 4),
            (71, 4),
            (72, 3),
            (73, 3),
            (74, 4),
            (75, 4),
            (76, 4),
            (77, 3),
            (78, 4),
            (79, 4),
            (80, 3),
            (81, 4),
            (82, 4)],
           names=['model_year', 'cylinders'])

CodePudding user response:

You could use groupby idxmin to create a mask and filter df with it:

out = df.loc[df.groupby('model_year')['cylinders'].idxmin()]

Output:

    model_year  cylinders        mpg
0           70          4  25.285714
3           71          4  27.461538
6           72          3  19.000000
9           73          3  18.000000
13          74          4  27.800000
16          75          4  25.250000
19          76          4  26.766667
22          77          3  21.500000
26          78          4  29.576471
30          79          4  31.525000
34          80          3  23.700000
38          81          4  32.814286
41          82          4  32.071429

CodePudding user response:

I think a simpler solution would actually be to use groupby transform:

selected = df[df['cylinders'] == df.groupby('model_year')['cylinders'].transform('min')]

Output:

>>> selected
    model_year  cylinders        mpg
0           70          4  25.285714
3           71          4  27.461538
6           72          3  19.000000
9           73          3  18.000000
13          74          4  27.800000
16          75          4  25.250000
19          76          4  26.766667
22          77          3  21.500000
26          78          4  29.576471
30          79          4  31.525000
34          80          3  23.700000
38          81          4  32.814286
41          82          4  32.071429

(Note that if there are multiple minimums for a group (e.g. for model_year 70 there are two 4-cylinder rows), they will be included in the output.)

CodePudding user response:

You can just try

out = df.sort_values('cylinders',ascending=False).drop_duplicates('model_year')
  • Related