I have a large df. The data is grouped by col, row, year, No, potveg and total. I am trying to get the max value of 'total' column in a specific year of a group ONLY if its 'Possible' value is TRUE. If the max 'total' value is FALSE, then get the second max value, and so on.
i.e., for the dataset below:
col row year No potveg total Possible
-125 42.5 2015 1 9 697.3 FALSE
2015 2 13 535.2 TRUE
2015 3 15 82.3 TRUE
2016 1 9 907.8 TRUE
2016 2 13 137.6 FALSE
2016 3 15 268.4 TRUE
2017 1 9 961.9 FALSE
2017 2 13 74.2 TRUE
2017 3 15 248 TRUE
2018 1 9 937.9 TRUE
2018 2 13 575.6 TRUE
2018 3 15 215.5 FALSE
-135 70.5 2015 1 8 697.3 FALSE
2015 2 10 535.2 TRUE
2015 3 19 82.3 TRUE
2016 1 8 907.8 TRUE
2016 2 10 137.6 FALSE
2016 3 19 268.4 TRUE
2017 1 8 961.9 FALSE
2017 2 10 74.2 TRUE
2017 3 19 248 TRUE
2018 1 8 937.9 TRUE
2018 2 10 575.6 TRUE
2018 3 19 215.5 FALSE
The output would be:
col row year No potveg total Possible
-125 42.5 2015 2 13 535.2 TRUE
2016 1 9 907.8 TRUE
2017 3 15 248 TRUE
2018 1 9 937.9 TRUE
-135 70.5 2015 2 10 535.2 TRUE
2016 1 8 907.8 TRUE
2017 3 19 248 TRUE
2018 1 8 937.9 TRUE
I tried:
df1 = pd.merge(df.groupby(['col','row','year']).agg({'total':'max'}).reset_index(),
df,
on=['col', 'row', 'year', 'total'])
But that only selects the max value
PS* If all the values of the 'possible' column in a specific year group = false, then I want to pick the max out of those so that I don't skip any years.
Thanks!
CodePudding user response:
You can do sort_values
first then groupby
wit head
out = df.sort_values(['Possible','total'], ascending = [False,False]).groupby(['col','row','year']).head(1)
CodePudding user response:
I have edited your dataset so the 2018 year has only false possible values, so I can demonstrate the method I've used below. The new dataset is as follows, in CSV format.
col,row,year,No,potveg,total,Possible
-125,42.5,2015,1,9,697.3,FALSE
,,2015,2,13,535.2,TRUE
,,2015,3,15,82.3,TRUE
,,2016,1,9,907.8,TRUE
,,2016,2,13,137.6,FALSE
,,2016,3,15,268.4,TRUE
,,2017,1,9,961.9,FALSE
,,2017,2,13,74.2,TRUE
,,2017,3,15,248,TRUE
,,2018,1,9,937.9,FALSE
,,2018,2,13,575.6,FALSE
,,2018,3,15,215.5,FALSE
-135,70.5,2015,1,8,697.3,FALSE
,,2015,2,10,535.2,TRUE
,,2015,3,19,82.3,TRUE
,,2016,1,8,907.8,TRUE
,,2016,2,10,137.6,FALSE
,,2016,3,19,268.4,TRUE
,,2017,1,8,961.9,FALSE
,,2017,2,10,74.2,TRUE
,,2017,3,19,248,TRUE
,,2018,1,8,937.9,FALSE
,,2018,2,10,575.6,FALSE
,,2018,3,19,215.5,FALSE
The next chunk of code should sum the totals, and will only preserve the maximum "FALSE" values in the "Possible" column if no corresponding "TRUE" values exist for that year.
import pandas as pd
df = pd.read_csv('test.csv')
# Forward fill the col and row columns so they have values in each row
df[['col', 'row']] = df[['col', 'row']].fillna(method='ffill')
# Separate out the true and false possibilities.
true_df = df[df['Possible'] == True]
false_df = df[df['Possible'] == False]
# Group the true and the false values by their max 'total' values.
truth_aggregated_df = pd.merge(true_df.groupby(['col', 'row', 'year']).agg({'total': 'max'}).reset_index(),
true_df,
on=['col', 'row', 'year', 'total'])
false_aggregated_df = pd.merge(false_df.groupby(['col', 'row', 'year']).agg({'total': 'max'}).reset_index(),
false_df,
on=['col', 'row', 'year', 'total'])
# Merge the true and the false dataframes on the col, row and year.
final_df = truth_aggregated_df.merge(false_aggregated_df, how='right', on=['col', 'row', 'year'], suffixes=('', '_false'))
# If there are null values in these four columns, it'll be because there exists a value in the false_df
# that doesn't exist in the true df.
final_df['total'].fillna(final_df['total_false'], inplace=True)
final_df['No'].fillna(final_df['No_false'], inplace=True)
final_df['potveg'].fillna(final_df['potveg_false'], inplace=True)
final_df['Possible'].fillna(final_df['Possible_false'], inplace=True)
# Once that is done, drop the extra columns from the merge (tagged with '_false' suffixes)
final_df = final_df.drop(final_df.filter(regex='\_false').columns, axis=1)
print(final_df)
The output of that print is:
col row year total No potveg Possible
0 -135.0 70.5 2015 535.2 2.0 10.0 True
1 -135.0 70.5 2016 907.8 1.0 8.0 True
2 -135.0 70.5 2017 248.0 3.0 19.0 True
3 -135.0 70.5 2018 937.9 1.0 8.0 False
4 -125.0 42.5 2015 535.2 2.0 13.0 True
5 -125.0 42.5 2016 907.8 1.0 9.0 True
6 -125.0 42.5 2017 248.0 3.0 15.0 True
7 -125.0 42.5 2018 937.9 1.0 9.0 False