I have in R the following data frame:
ID = c(rep(1,5),rep(2,3),rep(3,2),rep(4,6));ID
VAR = c("A","A","A","A","B","C","C","D",
"E","E","F","A","B","F","C","F");VAR
CATEGORY = c("ANE","ANE","ANA","ANB","ANE","BOO","BOA","BOO",
"CAT","CAT","DOG","ANE","ANE","DOG","FUT","DOG");CATEGORY
DATA = data.frame(ID,VAR,CATEGORY);DATA
That looks like this table below :
ID | VAR | CATEGORY |
---|---|---|
1 | A | ANE |
1 | A | ANE |
1 | A | ANA |
1 | A | ANB |
1 | B | ANE |
2 | C | BOO |
2 | C | BOA |
2 | D | BOO |
3 | E | CAT |
3 | E | CAT |
4 | F | DOG |
4 | A | ANE |
4 | B | ANE |
4 | F | DOG |
4 | C | FUT |
4 | F | DOG |
ideal output given the above data frame in R I want to be like that:
ID | TEXTS | category |
---|---|---|
1 | A | ANE |
2 | C | BOO |
3 | E | CAT |
4 | F | DOG |
More specifically: I want for ID say 1 to search the most common value in the column VAR which is A and then to search the most common value in the column CATEGORY related to the most common value A which is the ANE and so forth.
How can I do it in R ? Imagine that it is sample example.My real data frame contains 850.000 rows and has 14000 unique ID.
CodePudding user response:
Another dplyr
strategy using count
and slice
:
library(dplyr)
DATA %>%
group_by(ID) %>%
count(VAR, CATEGORY) %>%
slice(which.max(n)) %>%
select(-n)
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOA
3 3 E CAT
4 4 F DOG
CodePudding user response:
dplyr
library(dplyr)
DATA %>%
group_by(ID) %>%
filter(VAR == names(sort(table(VAR), decreasing=TRUE))[1]) %>%
group_by(ID, VAR) %>%
summarize(CATEGORY = names(sort(table(CATEGORY), decreasing=TRUE))[1]) %>%
ungroup()
# # A tibble: 4 x 3
# ID VAR CATEGORY
# <dbl> <chr> <chr>
# 1 1 A ANE
# 2 2 C BOA
# 3 3 E CAT
# 4 4 F DOG
Data
DATA <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4), VAR = c("A", "A", "A", "A", "B", "C", "C", "D", "E", "E", "F", "A", "B", "F", "C", "F"), CATEGORY = c("ANE", "ANE", "ANA", "ANB", "ANE", "BOO", "BOA", "BOO", "CAT", "CAT", "DOG", "ANE", "ANE", "DOG", "FUT", "DOG")), class = "data.frame", row.names = c(NA, -16L))
CodePudding user response:
We could modify the Mode
to return the index and use that in slice
after grouping by 'ID'
Modeind <- function(x) {
ux <- unique(x)
which.max(tabulate(match(x, ux)))
}
library(dplyr)
DATA %>%
group_by(ID) %>%
slice(Modeind(VAR)) %>%
ungroup
-output
# A tibble: 4 x 3
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOO
3 3 E CAT
4 4 F DOG
CodePudding user response:
A base R option with nested subset
ave
subset(
subset(
DATA,
!!ave(ave(ID, ID, VAR, FUN = length), ID, FUN = function(x) x == max(x))
),
!!ave(ave(ID, ID, VAR, CATEGORY, FUN = length), ID, VAR, FUN = function(x) seq_along(x) == which.max(x))
)
gives
ID VAR CATEGORY
1 1 A ANE
6 2 C BOO
9 3 E CAT
11 4 F DOG
Explanation
- The inner
subset
ave
is to filter out the rows with the most commonVAR
values (grouped byID
) - Based on the trimmed data frame the previous step, the outer
subset
ave
is to filter out the rows with the most commonCATEGORY
values ( grouped byID
VAR
)