Sorting a dataset based on 2 columns & computing averages of sub-datasets based on the 2 columns&#03-CodePudding

I have a data set that details polling data in different states and the percentage of people who have voted for either DEM or REP in that state. What my data frame looks like:

I'm essentially trying to find the average percentage of people in X state voting for either DEM or REP. So my output would be something like:

New Hampshire | DEM | 55% New Hampshire | REP | 45% Maine | DEM | 45% Maine | REP | 54% etc.

I initially thought of simply iterating over the entire dataset, and assigning new pct variables for each state's DEM percentage or REP percentage, but I felt that that is inefficient.

I'm thinking of sorting the data such that it has state1, DEM | state1, REP | state2, DEM | state3, REP etc. and then finding averages. But I am not too experienced with pandas (which is what I'm attempting to use). Perhaps someone can point me in the right direction.

CodePudding user response：

IIUC, use pandas.concat with GroupBy.mean :

cols = ["state", "party"]

(
    pd.concat([df_house, df_senate],
              ignore_index=True)
        .groupby(cols, as_index=False)
        .mean(numeric_only=True)
        .sort_values(by=cols)
)

This will return a (pandas.core.frame.DataFrame) that you can assign to a variable:

df_average = pd.concat([df_house, df_senate], ignore_index=True).groupby(cols, as_index=False).mean(numeric_only=True).sort_values(by=cols)

CodePudding user response：

try using df.groupby(['state','party'])['pct'].mean()