Ensure rounded percentages sum up to 100 per group (largest remainder method)-CodePudding

How can I update a column of weights, grouped by a unique name, in Pandas using the 'largest remainder method'? I want the weights to add up to 100% after they are rounded to 2 decimal points.

Input dataframe:

print(df)
     Name    Weight
0    John    33.3333
1    John    33.3333
2    John    33.3333
3    James   50
4    James   25
5    James   25
6    Kim     6.6666
5    Kim     93.3333
6    Jane    46.6666
7    Jane    6.6666
8    Jane    46.6666

Expected results:

print(df)
     Name    Weight   New Weight
0    John    3.3333   33.33    
1    John    3.3333   33.33
2    John    3.3333   33.34
3    James   50       50
4    James   25       25
5    James   25       25
6    Kim     6.6666   6.66
5    Kim     93.3333  93.34
6    Jane    46.6666  46.66
7    Jane    6.6666   6.67
8    Jane    46.6666  46.67

I've tried to apply the following functions:

Python Percentage Rounding

def round_to_100_percent(number_set, digit_after_decimal=2):
    """
        This function take a list of number and return a list of percentage, which represents the portion of each number in sum of all numbers
        Moreover, those percentages are adding up to 100%!!!
        Notice: the algorithm we are using here is 'Largest Remainder'
        The down-side is that the results won't be accurate, but they are never accurate anyway:)
    """
    unround_numbers = [x / float(sum(number_set)) * 100 * 10 ** digit_after_decimal for x in number_set]
    decimal_part_with_index = sorted([(index, unround_numbers[index] % 1) for index in range(len(unround_numbers))], key=lambda y: y[1], reverse=True)
    remainder = 100 * 10 ** digit_after_decimal - sum([int(x) for x in unround_numbers])
    index = 0
    while remainder > 0:
        unround_numbers[decimal_part_with_index[index][0]]  = 1
        remainder -= 1
        index = (index   1) % len(number_set)
    return [int(x) / float(10 ** digit_after_decimal) for x in unround_numbers]

Split (explode) pandas dataframe string entry to separate rows

def explode(df, lst_cols, fill_value='', preserve_index=False):
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)
    return res

This is what I tried so far:

new_column = df.groupby('Name')['Weight'].apply(round_to_100_percent)

#Merge new_column into main data frame
df = pd.merge(df, new_column, on='Name', how='outer')

#For some reason _y is added to col
df = df.explode('Weight_y')

df['New Weight'] = df['Weight_y']*0.01

It's not working in a couple of ways. Sometimes there are more rows than the original dataframe. Not sure why weight_y column is being created.

Is there a better way to apply the largest remainder rounding to a Pandas column?

CodePudding user response：

Here is a simple approach to add the missing (remove the extra) difference to 100 in the last item of the group (you can update to another item if you like):

df['rounded'] = (df['Weight']
 .round(2)
 .groupby(df['Name'])
 .transform(lambda s: pd.Series({s.index[-1]: (100-s.iloc[:-1].sum()).round(2)})
                        .combine_first(s))
)

output:

    Name   Weight  rounded
0   John  33.3333    33.33
1   John  33.3333    33.33
2   John  33.3333    33.34
3  James  50.0000    50.00
4  James  25.0000    25.00
5  James  25.0000    25.00
6    Kim   6.6666     6.67
5    Kim  93.3333    93.33
6   Jane  46.6666    46.67
7   Jane   6.6666     6.67
8   Jane  46.6666    46.66

Checking the sum:

df.groupby('Name')['rounded'].sum()

James    100.0
Jane     100.0
John     100.0
Kim      100.0
Name: rounded, dtype: float64