Given a dataframe like this:
id column1 column2 column3 column4 columns5
1 a b a b x
2 b a c b y
3 c b a c x
Considering these groups:
group_1 = [a,b,c]
group_2 = [x,y,z]
How can I get this output:
%_a %_b %_c %_x %_y %_z
0.3333 0.4167 0.25 0.667 0.333 0
because:
% of elements of group_1: a=4/12, b=5/12, c=3/12
% of elements of group_2: x=2/3, y=1/3, z=0
CodePudding user response:
Here's an approach with df.stack
:
group_1 = ['a','b','c']
group_2 = ['x','y','z']
s = df.stack()
res = (pd
.concat(
[s[s.isin(g)].value_counts(normalize=True)
.reindex(g).fillna(0)
for g in [group_1, group_2]]
)
.to_frame().T.add_prefix('%_')
)
res
%_a %_b %_c %_x %_y %_z
0 0.333333 0.416667 0.25 0.666667 0.333333 0.0
Explanation
- First, we use
df.stack
to turn thedf
into apd.Series
. - Now, for each group, we want to get only the values that belong to that particular group (using
Series.isin
), and to this selection we applySeries.value_counts
withnormalize
parameter set toTrue
. - Next, we chain
Series.reindex
to re-add any values from the group that are not present in the selection (i.e.z
fromgroup_1
); we supply the value0
for any such missing values by usingSeries.fillna
. - We use a list comprehension to apply this logic to both groups and use the result as input for
pd.concat
to get back oneSeries
. - Finally, we turn the
Series
into adf
(Series.to_frame
), transpose it (df.T
), and add a prefix to the columns (df.add_prefix
).
CodePudding user response:
You can use boolean masks for this. For example:
df.isin(group_1)
Will give you a boolean mask where the values of the DataFrame that are in the list group_1
are True and the rest are False.
Then counting the number of values that are true is as simple as using the method .sum()
twice or .values.sum()
(ie. convert to numpy array and then sum the true values).
On the same way, you can iterate through the elements of the group and use a boolean mask to filter individual values eg.
for i in group_1:
print(f"Number of {i}= {(df==i).values.sum()}")
For the solution I would put the groups in a dictionary and then iterate through the dictionary:
groups = {'group1': group_1, 'group2': group_2 }
results = {}
for group_name, group in groups.items():
count_group = df.isin(group).values.sum()
results[group_name] = {}
for val in group:
count_value = (df==val).values.sum()
results[group_name][val] = count_value/count_group
print(results)
CodePudding user response:
Here is a proposition with melt
& value_counts
:
g1 = df.isin(group_1).sum().sum()
g2 = df.isin(group_2).sum().sum()
d = {**{k:g1 for k in group_1}, **{k:g2 for k in group_2}}
out = (
df
.set_index("id").melt()
["value"].value_counts().to_frame()
.join(pd.Series(d).rename("c"), how="outer").fillna(0)
.assign(temp_col= lambda x: x.pop("value").div(x.pop("c")))
.T.sort_index(axis=1).add_prefix("%_").reset_index(drop=True)
)
Output :
print(out)
%_a %_b %_c %_x %_y %_z
0 0.333333 0.416667 0.25 0.666667 0.333333 0.0
CodePudding user response:
In case, elements of group_1
and group_2
can not coexist in a same column, a possible solution is, which is based on numpy
:
def get_percs(df, group, cols):
size = df.iloc[:,cols].values.size
return np.sum(
np.sum(df.iloc[:,cols].values ==
np.array(group)[:,None][:,None], axis=2),
axis=1) / size
get_percs(df, group_1, range(1,5)), get_percs(df, group_2, range(5,6))
Output:
(array([0.33333333, 0.41666667, 0.25 ]),
array([0.66666667, 0.33333333, 0. ]))
If the output is required to be a dataframe:
pd.concat([pd.DataFrame(get_percs(df, g[0], g[1]).reshape(1,-1),
columns=[f'%_{x}' for x in g[0]])
for g in zip([group_1, group_2], [range(1,5), range(5,6)])], axis=1)
Output:
%_a %_b %_c %_x %_y %_z
0 0.333333 0.416667 0.25 0.666667 0.333333 0.0