Pandas: Get max value of a group ONLY if the value satisfies given conditions-CodePudding

I have a large df. The data is grouped by col, row, year, No, potveg and total. I am trying to get the max value of 'total' column in a specific year of a group ONLY if its 'Possible' value is TRUE. If the max 'total' value is FALSE, then get the second max value, and so on.

i.e., for the dataset below:

col   row   year    No  potveg  total   Possible
                        
-125 42.5   2015    1   9       697.3   FALSE
            2015    2   13      535.2   TRUE
            2015    3   15      82.3    TRUE
            2016    1   9       907.8   TRUE
            2016    2   13      137.6   FALSE
            2016    3   15      268.4   TRUE
            2017    1   9       961.9   FALSE
            2017    2   13      74.2    TRUE
            2017    3   15      248     TRUE
            2018    1   9       937.9   TRUE
            2018    2   13      575.6   TRUE
            2018    3   15      215.5   FALSE
-135 70.5   2015    1   8       697.3   FALSE
            2015    2   10      535.2   TRUE
            2015    3   19      82.3    TRUE
            2016    1   8       907.8   TRUE
            2016    2   10      137.6   FALSE
            2016    3   19      268.4   TRUE
            2017    1   8       961.9   FALSE
            2017    2   10      74.2    TRUE
            2017    3   19      248     TRUE
            2018    1   8       937.9   TRUE
            2018    2   10      575.6   TRUE
            2018    3   19      215.5   FALSE

The output would be:

col   row   year    No  potveg  total   Possible
                        
-125 42.5   2015    2   13      535.2   TRUE
            2016    1   9       907.8   TRUE
            2017    3   15      248     TRUE
            2018    1   9       937.9   TRUE
            
-135 70.5   2015    2   10      535.2   TRUE
            2016    1   8       907.8   TRUE
            2017    3   19      248     TRUE
            2018    1   8       937.9   TRUE

I tried:

df1 = pd.merge(df.groupby(['col','row','year']).agg({'total':'max'}).reset_index(), 
               df, 
               on=['col', 'row', 'year', 'total'])

But that only selects the max value

PS* If all the values of the 'possible' column in a specific year group = false, then I want to pick the max out of those so that I don't skip any years.

Thanks!

CodePudding user response：

You can do sort_values first then groupby wit head

out = df.sort_values(['Possible','total'], ascending = [False,False]).groupby(['col','row','year']).head(1)

CodePudding user response：

I have edited your dataset so the 2018 year has only false possible values, so I can demonstrate the method I've used below. The new dataset is as follows, in CSV format.

col,row,year,No,potveg,total,Possible
-125,42.5,2015,1,9,697.3,FALSE
,,2015,2,13,535.2,TRUE
,,2015,3,15,82.3,TRUE
,,2016,1,9,907.8,TRUE
,,2016,2,13,137.6,FALSE
,,2016,3,15,268.4,TRUE
,,2017,1,9,961.9,FALSE
,,2017,2,13,74.2,TRUE
,,2017,3,15,248,TRUE
,,2018,1,9,937.9,FALSE
,,2018,2,13,575.6,FALSE
,,2018,3,15,215.5,FALSE
-135,70.5,2015,1,8,697.3,FALSE
,,2015,2,10,535.2,TRUE
,,2015,3,19,82.3,TRUE
,,2016,1,8,907.8,TRUE
,,2016,2,10,137.6,FALSE
,,2016,3,19,268.4,TRUE
,,2017,1,8,961.9,FALSE
,,2017,2,10,74.2,TRUE
,,2017,3,19,248,TRUE
,,2018,1,8,937.9,FALSE
,,2018,2,10,575.6,FALSE
,,2018,3,19,215.5,FALSE

The next chunk of code should sum the totals, and will only preserve the maximum "FALSE" values in the "Possible" column if no corresponding "TRUE" values exist for that year.

import pandas as pd

df = pd.read_csv('test.csv')
# Forward fill the col and row columns so they have values in each row
df[['col', 'row']] = df[['col', 'row']].fillna(method='ffill')

# Separate out the true and false possibilities.
true_df = df[df['Possible'] == True]
false_df = df[df['Possible'] == False]

# Group the true and the false values by their max 'total' values.
truth_aggregated_df = pd.merge(true_df.groupby(['col', 'row', 'year']).agg({'total': 'max'}).reset_index(),
                               true_df,
                               on=['col', 'row', 'year', 'total'])
false_aggregated_df = pd.merge(false_df.groupby(['col', 'row', 'year']).agg({'total': 'max'}).reset_index(),
                               false_df,
                               on=['col', 'row', 'year', 'total'])

# Merge the true and the false dataframes on the col, row and year.
final_df = truth_aggregated_df.merge(false_aggregated_df, how='right', on=['col', 'row', 'year'], suffixes=('', '_false'))
# If there are null values in these four columns, it'll be because there exists a value in the false_df
# that doesn't exist in the true df.
final_df['total'].fillna(final_df['total_false'], inplace=True)
final_df['No'].fillna(final_df['No_false'], inplace=True)
final_df['potveg'].fillna(final_df['potveg_false'], inplace=True)
final_df['Possible'].fillna(final_df['Possible_false'], inplace=True)
# Once that is done, drop the extra columns from the merge (tagged with '_false' suffixes)
final_df = final_df.drop(final_df.filter(regex='\_false').columns, axis=1)
print(final_df)

The output of that print is:

 col   row  year  total   No  potveg  Possible
0 -135.0  70.5  2015  535.2  2.0    10.0      True
1 -135.0  70.5  2016  907.8  1.0     8.0      True
2 -135.0  70.5  2017  248.0  3.0    19.0      True
3 -135.0  70.5  2018  937.9  1.0     8.0     False
4 -125.0  42.5  2015  535.2  2.0    13.0      True
5 -125.0  42.5  2016  907.8  1.0     9.0      True
6 -125.0  42.5  2017  248.0  3.0    15.0      True
7 -125.0  42.5  2018  937.9  1.0     9.0     False