Select most repeated item per row in R data.table-CodePudding

Good evening.

Currently I have a R data.table with the following structure

name_match1 | name_match2 | name_match3 | real_value
COCA COLA     CACAO         COCA COLA     COCA.COLA
PEPSI         PEPSI         PEPSI         PEPCSI
MONSTER       MORE          MIRINDA       MSTER

in which each name_match performs a fuzzy match with the real_value column. However there are some instances in which the fuzzy match doesn't perform a good job or the real_value might be impossible/hard to detect (the real_value column has value that a user inserts. These input in some cases is not inserted correctly, hence making it hard to even know what it refers to like the MSTER case).

What I would like to do is select per row the record that repeats the most, and if there's none, then don't select any like the following example:

name_match1 | name_match2 | name_match3 | real_value | most_repeated_value | Times_repeated
COCA COLA     CACAO         COCA COLA     COCA.COLA    COCA COLA                  2
PEPSI         PEPSI         PEPSI         PEPCSI       PEPSI                      3
MONSTER       MORE          MIRINDA       MSTER        NULL                       0

However I don't know how to perform this in R data.table. I was wondering if someone knew a way to perform this?

CodePudding user response：

We can adapt the top answer at the How to find the statistical mode FAQ to return NA if there is no unique mode:

Mode <- function(x) {
  ux <- unique(x)
  tab = tabulate(match(x, ux))
  if(length(which(max(tab) == tab)) > 1) return(NA)
  return(ux[which.max(tab)])
}

Then we need to apply this function to the match columns every row.

library(data.table)
dat = fread(text = 'name_match1  name_match2  name_match3  real_value
"COCA COLA"     CACAO         "COCA COLA"     COCA.COLA
PEPSI         PEPSI         PEPSI         PEPCSI
MONSTER       MORE          MIRINDA       MSTER')

dat[, result := apply(dat[, 1:3], 1, Mode)]
dat
#    name_match1 name_match2 name_match3 real_value    result
# 1:   COCA COLA       CACAO   COCA COLA  COCA.COLA COCA COLA
# 2:       PEPSI       PEPSI       PEPSI     PEPCSI     PEPSI
# 3:     MONSTER        MORE     MIRINDA      MSTER      <NA>