Home > Back-end >  How to determine if a value appears the most in a set of rows using Dplyr?
How to determine if a value appears the most in a set of rows using Dplyr?

Time:10-06

So I have data as such

USER    source
A       orange
A       apple
B       banana
B       banana
B       grape
C       grape
D       orange
D       orange

All I want to do is show the source which appeared the most for each USER. BUT in addition, there are instances where there are ties, in which case both winners of the tie should be represented as .50 in a weight variable.

My final result in doing this to the above data should be as follows:

USER    source   weight
A       orange   .5
A       apple    .5
B       banana   1
C       grape    1
D       orange   1

So basically, I only use the weight column because there can be ties that need accounting for..., but I want to have as many 1's as possible.

In R code:

data <- structure(list(USER = c("A", "A", "B", "B", "B", "C", "D", "D"
), source = c("orange", "apple", "banana", "banana", "grape", 
"grape", "orange", "orange")), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x000001f74a701ef0>)

CodePudding user response:

Using the Modes function from here you may solve this with dplyr.

library(dplyr)

Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

data %>%
  group_by(USER) %>%
  summarise(source = Modes(source)) %>%
  mutate(weight = 1/n()) %>%
  ungroup

#  USER  source weight
#  <chr> <chr>   <dbl>
#1 A     orange    0.5
#2 A     apple     0.5
#3 B     banana    1  
#4 C     grape     1  
#5 D     orange    1  
  • Related