Home > Back-end >  How to choose the most common value in a group related to other group in R?
How to choose the most common value in a group related to other group in R?

Time:09-17

I have in R the following data frame:

ID = c(rep(1,5),rep(2,3),rep(3,2),rep(4,6));ID
VAR = c("A","A","A","A","B","C","C","D",
             "E","E","F","A","B","F","C","F");VAR
CATEGORY = c("ANE","ANE","ANA","ANB","ANE","BOO","BOA","BOO",
        "CAT","CAT","DOG","ANE","ANE","DOG","FUT","DOG");CATEGORY

DATA = data.frame(ID,VAR,CATEGORY);DATA

That looks like this table below :

ID VAR CATEGORY
1 A ANE
1 A ANE
1 A ANA
1 A ANB
1 B ANE
2 C BOO
2 C BOA
2 D BOO
3 E CAT
3 E CAT
4 F DOG
4 A ANE
4 B ANE
4 F DOG
4 C FUT
4 F DOG

ideal output given the above data frame in R I want to be like that:

ID TEXTS category
1 A ANE
2 C BOO
3 E CAT
4 F DOG

More specifically: I want for ID say 1 to search the most common value in the column VAR which is A and then to search the most common value in the column CATEGORY related to the most common value A which is the ANE and so forth.

How can I do it in R ? Imagine that it is sample example.My real data frame contains 850.000 rows and has 14000 unique ID.

CodePudding user response:

Another dplyr strategy using count and slice:

library(dplyr)
DATA %>% 
    group_by(ID) %>% 
    count(VAR, CATEGORY) %>% 
    slice(which.max(n)) %>% 
    select(-n)
     ID VAR   CATEGORY
  <dbl> <chr> <chr>   
1     1 A     ANE     
2     2 C     BOA     
3     3 E     CAT     
4     4 F     DOG  

CodePudding user response:

dplyr

library(dplyr)
DATA %>%
  group_by(ID) %>%
  filter(VAR == names(sort(table(VAR), decreasing=TRUE))[1]) %>%
  group_by(ID, VAR) %>%
  summarize(CATEGORY = names(sort(table(CATEGORY), decreasing=TRUE))[1]) %>%
  ungroup()
# # A tibble: 4 x 3
#      ID VAR   CATEGORY
#   <dbl> <chr> <chr>   
# 1     1 A     ANE     
# 2     2 C     BOA     
# 3     3 E     CAT     
# 4     4 F     DOG     

Data

DATA <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4), VAR = c("A", "A", "A", "A", "B", "C", "C", "D", "E", "E", "F", "A", "B", "F", "C", "F"), CATEGORY = c("ANE", "ANE", "ANA", "ANB", "ANE", "BOO", "BOA", "BOO", "CAT", "CAT", "DOG", "ANE", "ANE", "DOG", "FUT", "DOG")), class = "data.frame", row.names = c(NA, -16L))

CodePudding user response:

We could modify the Mode to return the index and use that in slice after grouping by 'ID'

Modeind <- function(x) {
   ux <- unique(x)
   which.max(tabulate(match(x, ux)))
   }
library(dplyr)
DATA %>%
     group_by(ID) %>%
     slice(Modeind(VAR)) %>%
     ungroup

-output

# A tibble: 4 x 3
     ID VAR   CATEGORY
  <dbl> <chr> <chr>   
1     1 A     ANE     
2     2 C     BOO     
3     3 E     CAT     
4     4 F     DOG     

CodePudding user response:

A base R option with nested subset ave

subset(
  subset(
    DATA,
    !!ave(ave(ID, ID, VAR, FUN = length), ID, FUN = function(x) x == max(x))
  ),
  !!ave(ave(ID, ID, VAR, CATEGORY, FUN = length), ID, VAR, FUN = function(x) seq_along(x) == which.max(x))
)

gives

   ID VAR CATEGORY
1   1   A      ANE
6   2   C      BOO
9   3   E      CAT
11  4   F      DOG

Explanation

  1. The inner subset ave is to filter out the rows with the most common VAR values (grouped by ID)
  2. Based on the trimmed data frame the previous step, the outer subset ave is to filter out the rows with the most common CATEGORY values ( grouped by ID VAR)
  • Related