I would like to get the total sum of value occurrences in col Z
for every W,X,Y
group and only keep the top 3. The sum of occurrences of all other rows in W,X,Y
group should be grouped under "Other"
I was able to get the sum for each value in a new column COUNT
, but not sure how to limit this to top 3, and how to group all others under "Other". Any help would be much appreciated...
data_grouped = data.groupby(["W", "X", "Y"])
for group_name, group in data_grouped:
res = group.groupby(["Z"]).size().reset_index(name="COUNT")
More processing stuff and store in db...
INPUT
| W | X | Y | Z |
| - | - | - | - |
| a | d | x | |
| b | d | f | h |
| b | d | f | h |
| a | d | f | |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | j |
| b | d | f | j |
| b | d | f | j |
| b | d | f | k |
| b | d | f | k |
| b | d | f | l |
| b | d | f | l |
| b | d | f | m |
| b | d | f | m |
| b | d | f | n |
| b | d | f | |
| b | d | f | |
| b | d | f | |
| a | d | f | |
| a | d | f | |
| c | e | g | |
| c | e | g | |
| c | e | g | |
EXPECTED OUTPUT
| Z | W | X | Y | COUNT |
| ----- | - | - | - | ----- |
| h | b | d | f | 12 |
| i | b | d | f | 7 |
| j | b | d | f | 3 |
| Other | b | d | f | 7 | <-- sum of k,l,m,n
and so on...
CodePudding user response:
Here is one way of completing this:
data_grouped = df.groupby(["W", "X", "Y"])
grand_output = pd.DataFrame(columns = ["Z", "W", "X", "Y", "COUNT"])
for group_name, group in data_grouped:
# output dataframe for group
output = pd.DataFrame(columns=[])
res = group.groupby(["Z"]).size().reset_index(name="COUNT")
# create dataframe of res and W, X, Y columns
output = pd.concat([pd.DataFrame([list(group_name)]*len(res), columns=["W", "X", "Y"]), res], axis=1, ignore_index=True)
output.columns = ["W", "X", "Y", "Z", "COUNT"]
# sort and sum
output.sort_values(["COUNT", "Z"], ascending=False, inplace=True)
if len(output) > 3:
others = output.iloc[3:]["COUNT"].sum()
output = pd.concat([output.iloc[:3], pd.DataFrame([list(group_name) ["other", others]], columns=["W", "X", "Y", "Z", "COUNT"])])
# append to final output
grand_output = grand_output.append(output)
grand_output
#Out:
# Z W X Y COUNT
#0 a d f 3
#0 a d x 1
#1 h b d f 12
#2 i b d f 7
#3 j b d f 3
#0 other b d f 10
#0 c e g 3
CodePudding user response:
You can use value_counts
to find the counts; then groupby.head
to get the top 3. Then filter out the top 3 values and use groupby.sum
to get the total of OTHER
. Finally, append
this back to top3
:
counts = df.value_counts(['W','X','Y','Z'])
top3 = counts.groupby(level=[0,1,2]).head(3)
out = (top3.append(counts[~counts.isin(top3)].reset_index(level='Z')
.assign(Z='Other').set_index('Z', append=True).squeeze()
.groupby(level=[0,1,2,3]).sum()).reset_index(name='COUNT'))
Output:
W X Y Z COUNT
0 b d f h 12
1 b d f i 7
2 b d f j 3
3 b d f Other 7