Data filtration-CodePudding

I have the data of below type:

df = pd.DataFrame({'Election Yr.':[2000,2000,2000,2000,2000,2000,2005,2005,2005,2005,2005,2005],
                   'Party':['A','A','B','B','C','C','A','A','B','B','C','C',],
                   'Votes':[50,30,40,50,30,40,50,30,40,50,30,40],
                   'Odd':['aa','bb','cc','dd','ee','gg','ff', 'hh', 'jj', 'kk', 'll', 'yy']})

I want to groupby Election Yr. and sum the values in Votes, while keeping the values for Party that correspond to max value of Votes for each unique Election Yr. and first value for Odd.

I came up with a partial solution like below:

df = pd.DataFrame(df.groupby(['Election Yr.'], as_index=False).agg({'Votes':'sum', 'Party':'idxmax', 'Odd':'first'}))

But the Party:'idxmax' seems to be incorrect since the values are strings

CodePudding user response：

def get_max_votes_info(group):
    # Find the row with the maximum votes
    max_row = group[group['Votes'] == group['Votes'].max()].iloc[0]
    # Return a tuple with the sum of the votes and the party and odd values for the row with the maximum votes
    return (group['Votes'].sum(), max_row['Party'], max_row['Odd'])

# Group the DataFrame by 'Election Yr.' and apply the custom function
df_grouped = df.groupby('Election Yr.').apply(get_max_votes_info)

# The resulting DataFrame has a multi-index, with the election year as the first level and the party and odd values as the second level
# You can reset the index to remove the multi-index if desired
df_result = df_grouped.reset_index()

# The resulting DataFrame should have three columns: 'Election Yr.', 'Votes', and 'Party', with the values you specified
print(df_result)

or try:

df_grouped = df.groupby('Election Yr.').agg({
    'Votes': 'sum',
    'Party': lambda x: x.loc[x.idxmax()],  # returns the value of 'Party' for the row with the maximum votes
    'Odd': 'first'  # returns the first value of 'Odd'
})

# The resulting DataFrame has a single index, with the election year as the only level
# You can reset the index to convert the index to a column if desired
df_result = df_grouped.reset_index()

# The resulting DataFrame should have three columns: 'Election Yr.', 'Votes', and 'Party', with the values you specified
print(df_result)

or you can use it to apply the max function to the 'Votes' column, and then use the resulting index to select the corresponding value of the 'Party' column:

def get_max_votes_info(group):
    # Find the index of the row with the maximum votes
    max_index = group['Votes'].idxmax()
    # Return a tuple with the sum of the votes and the party and odd values for the row with the maximum votes
    return (group['Votes'].sum(), group.loc[max_index, 'Party'], group.loc[max_index, 'Odd'])

df_grouped = df.groupby('Election Yr.').apply(get_max_votes_info)
df_result = df_grouped.reset_index()

# The resulting DataFrame should have three columns: 'Election Yr.', 'Votes', and 'Party', with the values you specified
print(df_result)

CodePudding user response：

You can create two dataframes:

With sum of Votes and first value of Odd, per year
With the party that matches the maximum value of the years (can be multiple)

df_sum_odd = df.groupby(['Election Yr.'], as_index=False).agg({'Votes':'sum', 'Odd':'first'})
df_party = df[df.groupby('Election Yr.').Votes.transform('max') == df.Votes][['Election Yr.', 'Party']]
df_sum_odd.merge(df_party, on = 'Election Yr.', how='right')

After merging those two you will receive the following result:

    Election Yr.    Votes   Odd Party
0   2000            240     aa  A
1   2000            240     aa  B
2   2005            240     ff  A
3   2005            240     ff  B

Note that there are two lines per election year, since there are two different parties that correspond with the max Votes value per year (50).