So I have data as such
USER source
A orange
A apple
B banana
B banana
B grape
C grape
D orange
D orange
All I want to do is show the source which appeared the most for each USER. BUT in addition, there are instances where there are ties, in which case both winners of the tie should be represented as .50 in a weight variable.
My final result in doing this to the above data should be as follows:
USER source weight
A orange .5
A apple .5
B banana 1
C grape 1
D orange 1
So basically, I only use the weight column because there can be ties that need accounting for..., but I want to have as many 1's as possible.
In R code:
data <- structure(list(USER = c("A", "A", "B", "B", "B", "C", "D", "D"
), source = c("orange", "apple", "banana", "banana", "grape",
"grape", "orange", "orange")), row.names = c(NA, -8L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x000001f74a701ef0>)
CodePudding user response:
Using the Modes
function from here you may solve this with dplyr
.
library(dplyr)
Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
data %>%
group_by(USER) %>%
summarise(source = Modes(source)) %>%
mutate(weight = 1/n()) %>%
ungroup
# USER source weight
# <chr> <chr> <dbl>
#1 A orange 0.5
#2 A apple 0.5
#3 B banana 1
#4 C grape 1
#5 D orange 1