Get the percentage of each element in a group in a dataframe-CodePudding

Given a dataframe like this:

id  column1   column2   column3   column4    columns5
1      a         b        a          b           x
2      b         a        c          b           y
3      c         b        a          c           x

Considering these groups:

group_1 = [a,b,c]
group_2 = [x,y,z]

How can I get this output:

%_a        %_b       %_c     %_x        %_y       %_z
0.3333   0.4167     0.25     0.667      0.333      0

because:

% of elements of group_1:  a=4/12, b=5/12, c=3/12
% of elements of group_2:  x=2/3, y=1/3, z=0

CodePudding user response：

Here's an approach with df.stack:

group_1 = ['a','b','c']
group_2 = ['x','y','z']

s = df.stack()

res = (pd
       .concat(
           [s[s.isin(g)].value_counts(normalize=True)
            .reindex(g).fillna(0) 
            for g in [group_1, group_2]]
           )
       .to_frame().T.add_prefix('%_')
       )

res

        %_a       %_b   %_c       %_x       %_y  %_z
0  0.333333  0.416667  0.25  0.666667  0.333333  0.0

Explanation

First, we use df.stack to turn the df into a pd.Series.
Now, for each group, we want to get only the values that belong to that particular group (using Series.isin), and to this selection we apply Series.value_counts with normalize parameter set to True.
Next, we chain Series.reindex to re-add any values from the group that are not present in the selection (i.e. z from group_1); we supply the value 0 for any such missing values by using Series.fillna.
We use a list comprehension to apply this logic to both groups and use the result as input for pd.concat to get back one Series.
Finally, we turn the Series into a df (Series.to_frame), transpose it (df.T), and add a prefix to the columns (df.add_prefix).

CodePudding user response：

You can use boolean masks for this. For example:

df.isin(group_1)

Will give you a boolean mask where the values of the DataFrame that are in the list group_1 are True and the rest are False. Then counting the number of values that are true is as simple as using the method .sum() twice or .values.sum() (ie. convert to numpy array and then sum the true values).

On the same way, you can iterate through the elements of the group and use a boolean mask to filter individual values eg.

for i in group_1:
    print(f"Number of {i}= {(df==i).values.sum()}")

For the solution I would put the groups in a dictionary and then iterate through the dictionary:

groups = {'group1': group_1, 'group2': group_2 }
results = {}
for group_name, group in groups.items():
    count_group = df.isin(group).values.sum()
    results[group_name] = {}
    for val in group:
        count_value = (df==val).values.sum()
        results[group_name][val] = count_value/count_group
print(results)

CodePudding user response：

Here is a proposition with melt & value_counts :

g1 = df.isin(group_1).sum().sum()
g2 = df.isin(group_2).sum().sum()

d = {**{k:g1 for k in group_1}, **{k:g2 for k in group_2}}

out = (
        df
          .set_index("id").melt()
           ["value"].value_counts().to_frame()
          .join(pd.Series(d).rename("c"), how="outer").fillna(0)
          .assign(temp_col= lambda x: x.pop("value").div(x.pop("c")))
          .T.sort_index(axis=1).add_prefix("%_").reset_index(drop=True)

      )

Output :

print(out)

        %_a       %_b   %_c       %_x       %_y  %_z
0  0.333333  0.416667  0.25  0.666667  0.333333  0.0

CodePudding user response：

In case, elements of group_1 and group_2 can not coexist in a same column, a possible solution is, which is based on numpy:

def get_percs(df, group, cols):
    size = df.iloc[:,cols].values.size
    return np.sum(
        np.sum(df.iloc[:,cols].values ==
               np.array(group)[:,None][:,None], axis=2),
        axis=1) / size

get_percs(df, group_1, range(1,5)), get_percs(df, group_2, range(5,6))

Output:

(array([0.33333333, 0.41666667, 0.25      ]),
 array([0.66666667, 0.33333333, 0.        ]))

If the output is required to be a dataframe:

pd.concat([pd.DataFrame(get_percs(df, g[0], g[1]).reshape(1,-1), 
                        columns=[f'%_{x}' for x in g[0]]) 
           for g in zip([group_1, group_2], [range(1,5), range(5,6)])], axis=1)

Output:

        %_a       %_b   %_c       %_x       %_y  %_z
0  0.333333  0.416667  0.25  0.666667  0.333333  0.0