Home > Enterprise >  How to update a column in a data.frame based on the most frequent entry in R
How to update a column in a data.frame based on the most frequent entry in R

Time:09-30

mydat <- data.frame(name = c("James", "James", "James", "James", "Leila", "Leila", "Cici", "Bob", "Bob", "Holly", "Topaz", "Topaz"),
                    code = c(123, 928, 981, 333, 981, 928, 463, 123, 928, 981, 333, 444))
> mydat
    name code
1  James  123
2  James  928
3  James  981
4  James  333
5  Leila  981
6  Leila  928
7   Cici  463
8    Bob  123
9    Bob  928
10 Holly  981
11 Topaz  333
12 Topaz  444

In mydat, each person can have multiple codes associated with it. I want to tabulate the code column and assign the person the code that is most common.

> rbind(table(mydat), Total = colSums(table(mydat)))
      123 333 444 463 928 981
Bob     1   0   0   0   1   0
Cici    0   0   0   1   0   0
Holly   0   0   0   0   0   1
James   1   1   0   0   1   1
Leila   0   0   0   0   1   1
Topaz   0   1   1   0   0   0
Total   2   2   1   1   3   3

Bob has codes 123 and 928. Since 928 appears more often across mydat than 123, Bob will be assigned a code of 928.

Cici has code 463, so Cici will be assigned a code of 463.

Holly has code 981, so Holly will be assigned a code of 981.

James has codes 123, 333, 928, and 981. Since 928 and 981 both appear more frequently than 123 and 333, James will be assigned codes 928 and 981.

The final output should be:

> final_mydat
   name final_code
1   Bob        928
2  Cici        463
3 Holly        981
4 James        928
5 James        981
6 Leila        928
7 Leila        981
8 Topaz        333

Is there a quick way to do this in R?

CodePudding user response:

library(dplyr)
mydat %>%
  add_count(code) %>%
  group_by(name) %>%
  slice_max(n) %>%
  ungroup()
# # A tibble: 8 × 3
#   name   code     n
#   <chr> <dbl> <int>
# 1 Bob     928     3
# 2 Cici    463     1
# 3 Holly   981     3
# 4 James   928     3
# 5 James   981     3
# 6 Leila   981     3
# 7 Leila   928     3
# 8 Topaz   333     2

Add ... %>% select(-n) if you want to get rid of the n column.

CodePudding user response:

In base R we can do

by(transform(mydat, score=ave(code, code, FUN=length)), mydat$name, \(x) {
  with(x, x[score == max(score), ])}) |> c(make.row.names=FALSE) |> do.call(what=rbind)
#    name code score
# 1   Bob  928     3
# 2  Cici  463     1
# 3 Holly  981     3
# 4 James  928     3
# 5 James  981     3
# 6 Leila  981     3
# 7 Leila  928     3
# 8 Topaz  333     2
  • Related