Total sum of top 3 value occurrences in groups in pandas-CodePudding

I would like to get the total sum of value occurrences in col Z for every W,X,Y group and only keep the top 3. The sum of occurrences of all other rows in W,X,Y group should be grouped under "Other"

I was able to get the sum for each value in a new column COUNT, but not sure how to limit this to top 3, and how to group all others under "Other". Any help would be much appreciated...

data_grouped = data.groupby(["W", "X", "Y"])

for group_name, group in data_grouped: 
  res = group.groupby(["Z"]).size().reset_index(name="COUNT")

  More processing stuff and store in db...

INPUT

| W | X | Y | Z |
| - | - | - | - |
| a | d | x |   |
| b | d | f | h |
| b | d | f | h |
| a | d | f |   |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | h |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | i |
| b | d | f | j |
| b | d | f | j |
| b | d | f | j |
| b | d | f | k |
| b | d | f | k |
| b | d | f | l |
| b | d | f | l |
| b | d | f | m |
| b | d | f | m |
| b | d | f | n |
| b | d | f |   |
| b | d | f |   |
| b | d | f |   |
| a | d | f |   |
| a | d | f |   |
| c | e | g |   |
| c | e | g |   |
| c | e | g |   |

EXPECTED OUTPUT

| Z     | W | X | Y | COUNT |
| ----- | - | - | - | ----- |
| h     | b | d | f |  12   |
| i     | b | d | f |  7    | 
| j     | b | d | f |  3    | 
| Other | b | d | f |  7    |  <-- sum of k,l,m,n
and so on...

CodePudding user response：

Here is one way of completing this:

data_grouped = df.groupby(["W", "X", "Y"])
grand_output = pd.DataFrame(columns = ["Z", "W", "X", "Y", "COUNT"])

for group_name, group in data_grouped: 
    # output dataframe for group
    output = pd.DataFrame(columns=[])
    res = group.groupby(["Z"]).size().reset_index(name="COUNT")
    # create dataframe of res and W, X, Y columns
    output = pd.concat([pd.DataFrame([list(group_name)]*len(res), columns=["W", "X", "Y"]), res], axis=1, ignore_index=True)
    output.columns = ["W", "X", "Y", "Z", "COUNT"]
    # sort and sum
    output.sort_values(["COUNT", "Z"], ascending=False, inplace=True)
    if len(output) > 3:
        others = output.iloc[3:]["COUNT"].sum()
        output = pd.concat([output.iloc[:3], pd.DataFrame([list(group_name) ["other", others]], columns=["W", "X", "Y", "Z", "COUNT"])])
    # append to final output
    grand_output = grand_output.append(output)

grand_output
#Out: 
#       Z  W  X  Y COUNT
#0         a  d  f     3
#0         a  d  x     1
#1      h  b  d  f    12
#2      i  b  d  f     7
#3      j  b  d  f     3
#0  other  b  d  f    10
#0         c  e  g     3

CodePudding user response：

You can use value_counts to find the counts; then groupby.head to get the top 3. Then filter out the top 3 values and use groupby.sum to get the total of OTHER. Finally, append this back to top3:

counts = df.value_counts(['W','X','Y','Z'])
top3 = counts.groupby(level=[0,1,2]).head(3)
out = (top3.append(counts[~counts.isin(top3)].reset_index(level='Z')
                   .assign(Z='Other').set_index('Z', append=True).squeeze()
                   .groupby(level=[0,1,2,3]).sum()).reset_index(name='COUNT'))

Output:

   W  X  Y      Z  COUNT
0  b  d  f      h     12
1  b  d  f      i      7
2  b  d  f      j      3
3  b  d  f  Other      7