How to create a new column based on the mode of some other columns in a data frame in R-CodePudding

I have a data frame like this:

id	w1	w2	w3	w4	w5	w6
11	light	light	light	light	light	light
22	light	light	light	light	medium	medium
33	light	light	medium	medium	medium	heavy
44	light	light	medium	NA	NA	NA
55	light	light	medium	medium	NA	NA
66	medium	medium	medium	NA	NA	NA

I would like to get the frequency count of light, medium, heavy for each id across w1-w6. And I would to get the mode of w1-w6 as a new column.

The target df should look like this:

id	w1	w2	w3	w4	w5	w6	N_light	N_medium	N_heavy	final
11	light	light	light	light	light	light	6	0	0	light
22	light	light	light	light	medium	medium	4	2	0	light
33	light	light	medium	medium	medium	heavy	2	3	1	medium
44	light	light	medium	NA	NA	NA	2	1	0	light
55	light	light	medium	medium	NA	NA	2	2	0	light
66	medium	medium	medium	NA	NA	NA	0	3	0	medium

The real data frame has millions of rows. I struggle to find an efficient way to do this. Any ideas?

I tried the Mode function from DescTools library, that worked with a limited number of rows in a for loop. But it is too slow to run.

CodePudding user response：

I know this asks for dplyr, but if other find base R useful you could simply index and use *apply functions

xx <- unique(unlist(df[-1]))
xx <- xx[!is.na(xx)]
 # or xx <- c("light", "medium", "heavy")
newnames <- paste0("N_",xx)

df[newnames] <- sapply(xx, 
                       function(x) rowSums(df[,-1] == x, 
                                           na.rm = TRUE))
df["final"] <- xx[apply(df[newnames], 1, which.max)]

Output:

  id     w1     w2     w3     w4     w5     w6 N_light N_medium N_heavy  final
1 11  light  light  light  light  light  light       6        0       0  light
2 22  light  light  light  light medium medium       4        2       0  light
3 33  light  light medium medium medium  heavy       2        3       1 medium
4 44  light  light medium   <NA>   <NA>   <NA>       2        1       0  light
5 55  light  light medium medium   <NA>   <NA>       2        2       0  light
6 66 medium medium medium   <NA>   <NA>   <NA>       0        3       0 medium

CodePudding user response：

Here's a tidyverse solution:

df %>%
  #cast all columns except `id` longer:
  pivot_longer(-id) %>%
  # for each combination of ...
  group_by(id, value) %>%
  # ... count the frequencies of distinct values:
  summarise(N = ifelse(is.na(value), NA, n())) %>%
  # omit rows with `NA`:
  na.omit() %>% 
  # remove duplicated rows:
  slice_head() %>% 
  # for each `id`...
  group_by(id) %>%
  # ... cast back wider:
  pivot_wider(names_from = value, values_from = N,
              names_prefix = "N_") %>% 
  # replace `NA` with 0:
  mutate(across(starts_with("N"), ~replace_na(., 0))) %>%
  # bind result back to original `df`:
  bind_cols(df%>% select(-id), .) %>%
  # reorder columns:
  select(id, everything())
  id     w1     w2     w3     w4 N_light N_medium N_heavy
1  1  light  light  light  light       4        0       0
2  2  light  light  light  light       4        0       0
3  3  light  light medium medium       2        2       0
4  4  light  light   <NA> medium       2        1       0
5  5  light  light medium medium       2        2       0
6  6 medium medium   <NA>  heavy       0        2       1

Data:

df <- structure(list(id = 1:6, w1 = c("light", "light", "light", "light", 
                                      "light", "medium"), w2 = c("light", "light", "light", "light", 
                                                                 "light", "medium"), w3 = c("light", "light", "medium", NA, "medium", 
                                                                                            NA), w4 = c("light", "light", "medium", "medium", "medium", "heavy"
                                                                                            )), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response：

in Base R you could do:

a <- table(cbind(dat[1], stack(dat, -1))[1:2])
cbind(dat, as.data.frame.matrix(a), final = colnames(a)[max.col(a)])

   id     w1     w2     w3     w4     w5     w6 heavy light medium  final
11 11  light  light  light  light  light  light     0     6      0  light
22 22  light  light  light  light medium medium     0     4      2  light
33 33  light  light medium medium medium  heavy     1     2      3 medium
44 44  light  light medium   <NA>   <NA>   <NA>     0     2      1  light
55 55  light  light medium medium   <NA>   <NA>     0     2      2 medium
66 66 medium medium medium   <NA>   <NA>   <NA>     0     0      3 medium