I have a pandas dataframe with following columns
col1 col2 col3
x 12 abc
x 7 abc
x 5 abc
x 3
y 10 abc
y 9 abc
I would like to find all rows in a pandas DataFrame which have the max value for col2 column, after grouping by 'col1' columns after filtering the rows where col3 is null?
The expected output is:
col1 col2 col3
x 12 abc
y 10 abc
I have tried the below code so far.
df[df[['col3']].notnull().all(1) & df.sort_values('col2').drop_duplicates(['col1'], keep='last')]
However I am getting following error.
TypeError: unsupported operand type(s) for &: 'bool' and 'float'
Any help is highly appreciated
CodePudding user response:
How
max
method calculate without mentioning the column?
According to the pd.DataFrame.max
it returns the maximum values over the selected axis with default being (0, index).
In your example you only have 1 numeric values and all values in col3 are identical. If col3 was also numeric, max
method would return the max values of that column as well with the resulting DataFrame may have different rows than the original one.
It is suitable in this case but if you only would like the output DataFrame have the rows as the original one you need to be specific about the column whose maximum you would like to consider.
df.loc[df.notnull().all(axis=1)].groupby('col1').max().reset_index()
col1 col2 col3
0 x 12 abc
1 y 10 abc
Or you can create a boolean Series first and assign it to a name to improve readability:
m = df.notnull().all(axis=1)
df.loc[m].groupby('col1').max().reset_index()
Let's say now this is your original DataFrame:
col1 col2 col3
0 x 12 2.0
1 x 7 20.0
2 x 5 1.0
3 x 3 NaN
4 y 10 4.0
5 y 9 11.0
When you apply max
on this without specifying the column name it will return the following:
col1 col2 col3
0 x 12 20.0
1 y 10 11.0
CodePudding user response:
another solution
import pandas as pd
lstColumns=["col1","col2","col3"]
lstValues=[["x",12,"abc"],["x",7,"abc"],["x",5,"abc"],["x",3,"abc"],["y",10,"abc"],["y",9,"abc"]]
df=pd.DataFrame(lstValues,columns=lstColumns)
df=df.sort_values(['col1', 'col2'], ascending=[True, True])
newdf=df.drop_duplicates(subset='col1', keep="last")
col1 col2 col3
0 x 12 abc
4 y 10 abc