Group pandas df by every n rows with most frequent entry in column y for each set of n rows-CodePudding

I'm working in a Python / Mesa environment and producing agent-based model output data which looks like the following:

25920 rows by 11 columns. The df is sorted by the columns of which I've performed parametric runs producing all combinations of them: density, hero_prob, malevolent, tipping_point, agi_gain

The parameter space is 2592 and with 10 replications per combination...thus 25920 rows. What I am trying to do is create a new df for logistic regression which condenses every 10 rows of a given parameter combination with the most frequent entry for 'Engage' remains.

Thus, I'd end up a new 2592 x 11 df where the first row would be the same as the row with RunId value 2 above since all of those 'Engage' entries are False. A later parameter set with 7 True and 3 False would condense to a row with those parameter settings and True in the engage column for example. I tried the following code:

df_grouped = df_sorted.groupby(['density','hero_prob','malevolent','tipping_point','agi_gain'])['Engage'].agg(pd.Series.mode).to_frame()

But the results confuse me and I only have an Engage column, not the parameter settings. And agi_gain appears combined:

That structure isn't like my original df, which is what I'm trying to circle back to, but just condensed with the most common 'Engage' result.

Bonus: In the event of 5 True, 5 False for a given parametric combination an "Engage" entry of "Tie" would be amazing. Thank you as always to this amazing community.

CodePudding user response：

Since parameters in the group are all the same, it is possible to just use mode:

df_sorted.groupby(parameter_columns).agg(pd.Series.mode)

For the tie support, an aggregation function would look something like:

def tie_mode(series):
    counts = series.value_counts()
    if len(counts) == 1:  # a parameter column or all same results
        return next(iter(series))
    if counts.get(False) == counts.get(True):
        return 'tie'
    return counts.get(True, 0) > counts.get(False, 0)

df_sorted.groupby(parameter_columns).agg(my_agg)