Using groupby() for a dataframe in pandas resulted Index Error-CodePudding

I have this dataframe:

      x        y        z        parameter     
0     26       24       25       Age
1     35       37       36       Age  
2     57       52       54.5     Age
3     160      164      162      Hgt           
4     182      163      172.5    Hgt             
5     175      167      171      Hgt              
6     95       71       83       Wgt     
7     110      68       89       Wgt     
8     89       65       77       Wgt

I'm using pandas to get this final result:

      x        y        parameter     
0     160      164      Hgt           
1     182      163      Hgt             
2     175      167      Hgt

I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe

First, I added a column to set it as an index:

df = df.insert(0,'index', [count for count in range(df.shape[0])], True)

And the dataframe came out like this:

      index    x        y        z        parameter     
0     0        26       24       25       Age
1     1        35       37       36       Age  
2     2        57       52       54.5     Age
3     3        160      164      162      Hgt           
4     4        182      163      172.5    Hgt             
5     5        175      167      171      Hgt              
6     6        95       71       83       Wgt     
7     7        110      68       89       Wgt     
8     8        89       65       77       Wgt

Then, I used the following code to group based on index and extract the columns I need:

df1 = df.groupby('index')[['x', 'y','parameter']]

And the output was:

      x        y        parameter     
0     26       24       Age
1     35       37       Age  
2     57       52       Age
3     160      164      Hgt           
4     182      163      Hgt             
5     175      167      Hgt              
6     95       71       Wgt     
7     110      68       Wgt     
8     89       65       Wgt

After that, I used the following code to isolate only Hgt values:

df2 = df1[df1['parameter'] == 'Hgt']

When I ran df2, I got an error saying:

IndexError: Column(s) ['x', 'y', 'parameter'] already selected

Am I missing something here? What to do to get the final result?

CodePudding user response：

Do you really need groupby?

>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
     x    y parameter
0  160  164       Hgt
1  182  163       Hgt
2  175  167       Hgt

CodePudding user response：

Because you asked what you did wrong, let me point to useless/bad code.

Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:

df = df.insert(0,'index', [count for count in range(df.shape[0])], True)

This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.

But this step is not even needed for a groupby as you can group by index level:

df.groupby(level=0)

But... the groupby is useless anyways as you only have single membered groups.

Also, when you do:

df1 = df.groupby('index')[['x', 'y','parameter']]

df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on dogma membered groups).

So when you run:

df1[df1['parameter'] == 'Hgt']

again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.

I hope it helped!