I have a data frame like this:
id | w1 | w2 | w3 | w4 | w5 | w6 |
---|---|---|---|---|---|---|
11 | light | light | light | light | light | light |
22 | light | light | light | light | medium | medium |
33 | light | light | medium | medium | medium | heavy |
44 | light | light | medium | NA | NA | NA |
55 | light | light | medium | medium | NA | NA |
66 | medium | medium | medium | NA | NA | NA |
I would like to get the frequency count of light, medium, heavy for each id across w1-w6. And I would to get the mode of w1-w6 as a new column.
The target df should look like this:
id | w1 | w2 | w3 | w4 | w5 | w6 | N_light | N_medium | N_heavy | final |
---|---|---|---|---|---|---|---|---|---|---|
11 | light | light | light | light | light | light | 6 | 0 | 0 | light |
22 | light | light | light | light | medium | medium | 4 | 2 | 0 | light |
33 | light | light | medium | medium | medium | heavy | 2 | 3 | 1 | medium |
44 | light | light | medium | NA | NA | NA | 2 | 1 | 0 | light |
55 | light | light | medium | medium | NA | NA | 2 | 2 | 0 | light |
66 | medium | medium | medium | NA | NA | NA | 0 | 3 | 0 | medium |
The real data frame has millions of rows. I struggle to find an efficient way to do this. Any ideas?
I tried the Mode function from DescTools library, that worked with a limited number of rows in a for loop. But it is too slow to run.
CodePudding user response:
I know this asks for dplyr
, but if other find base R useful you could simply index and use *apply
functions
xx <- unique(unlist(df[-1]))
xx <- xx[!is.na(xx)]
# or xx <- c("light", "medium", "heavy")
newnames <- paste0("N_",xx)
df[newnames] <- sapply(xx,
function(x) rowSums(df[,-1] == x,
na.rm = TRUE))
df["final"] <- xx[apply(df[newnames], 1, which.max)]
Output:
id w1 w2 w3 w4 w5 w6 N_light N_medium N_heavy final
1 11 light light light light light light 6 0 0 light
2 22 light light light light medium medium 4 2 0 light
3 33 light light medium medium medium heavy 2 3 1 medium
4 44 light light medium <NA> <NA> <NA> 2 1 0 light
5 55 light light medium medium <NA> <NA> 2 2 0 light
6 66 medium medium medium <NA> <NA> <NA> 0 3 0 medium
CodePudding user response:
Here's a tidyverse
solution:
df %>%
#cast all columns except `id` longer:
pivot_longer(-id) %>%
# for each combination of ...
group_by(id, value) %>%
# ... count the frequencies of distinct values:
summarise(N = ifelse(is.na(value), NA, n())) %>%
# omit rows with `NA`:
na.omit() %>%
# remove duplicated rows:
slice_head() %>%
# for each `id`...
group_by(id) %>%
# ... cast back wider:
pivot_wider(names_from = value, values_from = N,
names_prefix = "N_") %>%
# replace `NA` with 0:
mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
# bind result back to original `df`:
bind_cols(df%>% select(-id), .) %>%
# reorder columns:
select(id, everything())
id w1 w2 w3 w4 N_light N_medium N_heavy
1 1 light light light light 4 0 0
2 2 light light light light 4 0 0
3 3 light light medium medium 2 2 0
4 4 light light <NA> medium 2 1 0
5 5 light light medium medium 2 2 0
6 6 medium medium <NA> heavy 0 2 1
Data:
df <- structure(list(id = 1:6, w1 = c("light", "light", "light", "light",
"light", "medium"), w2 = c("light", "light", "light", "light",
"light", "medium"), w3 = c("light", "light", "medium", NA, "medium",
NA), w4 = c("light", "light", "medium", "medium", "medium", "heavy"
)), class = "data.frame", row.names = c(NA, -6L))
CodePudding user response:
in Base R you could do:
a <- table(cbind(dat[1], stack(dat, -1))[1:2])
cbind(dat, as.data.frame.matrix(a), final = colnames(a)[max.col(a)])
id w1 w2 w3 w4 w5 w6 heavy light medium final
11 11 light light light light light light 0 6 0 light
22 22 light light light light medium medium 0 4 2 light
33 33 light light medium medium medium heavy 1 2 3 medium
44 44 light light medium <NA> <NA> <NA> 0 2 1 light
55 55 light light medium medium <NA> <NA> 0 2 2 medium
66 66 medium medium medium <NA> <NA> <NA> 0 0 3 medium