Home > Software engineering >  How to rank subgroups in dplyr?
How to rank subgroups in dplyr?

Time:07-24

In the below reproducible code, the user inputs a series of Element and Group codes into vectors. A Group number not = 0 means that Element is to be split into a sub-group and is "paired with" the Elements next to it sharing the same Group code. ElementCnt means the sequential number of times that Element appears in the vector.

I am trying to create a dplyr::mutate(...) statement for GroupRank in the reproducible code that returns the following GroupRank for each row: if Group = 0, then return that row's ElementCnt; if it is the first Element in a sub-group pair (again, sub-group flagged by Group not = 0), then take its ElementCnt - the number of pairings that occurred with a lower ElementCnt than it, as explained below under "Explain GroupRank"; and if it shares the same Group number as the Element immediately above it, then it shares the same GroupRank:

   Element Group ElementCnt GroupRank Explain GroupRank
1        A     0          1         1 GroupRank = ElementCnt since Group = 0
2        A     0          2         2 GroupRank = ElementCnt since Group = 0
3        B     0          1         1 GroupRank = ElementCnt since Group = 0
4        B    11          2         2 Since GroupRank > 0 and this is the first Element in Group 11, assign GroupRank = ElementCnt; no need to reduce ElementCnt since there are no pairings that occur with ElementCnt < 2
5        B    11          3         2 Since Group = Group immediately above, assign same GroupRank
6        B     0          4         4 GroupRank = ElementCnt since Group = 0
7        B     0          5         5 GroupRank = ElementCnt since Group = 0
8        B     0          6         6 GroupRank = ElementCnt since Group = 0
9        C     0          1         1 GroupRank = ElementCnt since Group = 0
10       A     0          3         3 GroupRank = ElementCnt since Group = 0
11       A     2          4         3 GroupRank should be 3 because Group 11 sub-group pair above causes all ElementCnt > 2 to decrease by 1; so ElementCnt of 4 - 1 = 3
12       A     2          5         3 Since Group = Group immediately above, assign same GroupRank
13       B    -3          7         5 GroupRank should be 5 because Group 11 sub-group pair above causes all ElementCnt > 2 to decrease by 1 and Group 2 sub-group pair above causes ElementCnt > 4 to decrease by 1 again; ElementCnt 7 - 1 - 1 = 5
14       B    -3          8         5 Since Group = Group immediately above, assign same GroupRank
15       B    -3          9         5 Since Group = Group immediately above, assign same GroupRank

Reproducible code (GroupRank is not working correctly):

library(dplyr)

myData <-
  data.frame(
    Element = c("A","A","B","B","B","B","B","B","C","A","A","A","B","B","B"),
    Group = c(0,0,0,11,11,0,0,0,0,0,2,2,-3,-3,-3)
  )

excelCopy <- myData %>% 
  group_by(Element) %>% 
  mutate(ElementCnt = row_number()) %>% 
  mutate(GroupRank = case_when(Group > 0 ~ match(Group, unique(Group)),TRUE ~ ElementCnt)) %>%
  ungroup() 

print.data.frame(excelCopy)

Code reflecting Marcus solution - with one modification to generalize to my 12 test scenarios (all modifications from Marcus are commented #):

excelCopy <- myData %>%
  mutate(origOrder = row_number()) %>%  # change to origOrder, clearer for me
  group_by(Element) %>% 
  mutate(
    ElementCnt = row_number()
  ) %>%
  ungroup() %>%  
  mutate(
    Group = factor(Group, unique(Group))
  ) %>% 
  arrange(Group) %>% 
  mutate(
    groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L
  ) %>%  
  group_by(Group) %>%  
  mutate(
    GroupRank = ElementCnt - groupCt,
    GroupRank = max(1,if_else( as.character(Group) == "0", ElementCnt, min(GroupRank))) # added a floor value of 1 to generalize to all my scenarios
  ) %>%  
  ungroup() %>% 
  arrange(origOrder) %>%  
  select(-groupCt)

CodePudding user response:

I'm not sure how well this generalizes, but it should match the desired output.

Since, Group is independent of Element you need to change the grouping depending on which statistic you are trying to add. ElementCt should be calculated with group_by(Element). GroupRank the data should be group_by(Group). Lastly, because Group is not actually a count (skips from 11 to 2 to -3) but the ordering matters, it should be treated as factor.

excelCopy <- myData %>%
  
  # add to be able to verify against desired output
  mutate(ogOrder = row_number()) %>%  
  
  group_by(Element) %>% 
  mutate(
    ElementCnt = row_number()
  ) %>%
  
  # done with the "Element" statistics
  ungroup() %>%  
  
  # Ordering matters, but can't rely on the Group value for ordering
  mutate(
    Group = factor(Group, unique(Group))
  ) %>% 
  arrange(Group) %>% 
  
  # need to know cumulative sub-group count to use as an offset
  # subtract 1 to 0 offset for first sub-group
  mutate(
    groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L
  ) %>%  
  
  # perform "Group" statistics
  group_by(Group) %>%  
  mutate(
    # set the first row in each sub group to the correct GroupRank
    GroupRank = ElementCnt - groupCt,
    
    # Correct the non-subgroup entries and non-initial sub-group rows
    GroupRank = if_else( as.character(Group) == "0", ElementCnt, min(GroupRank))
  ) %>%  
  ungroup() %>% 
  
  # add to be able to verify against desired output
  arrange(ogOrder) %>%  
  select(-groupCt)
  • Related