Home > Software design >  Select most repeated item per row in R data.table
Select most repeated item per row in R data.table

Time:12-11

Good evening.

Currently I have a R data.table with the following structure

name_match1 | name_match2 | name_match3 | real_value
COCA COLA     CACAO         COCA COLA     COCA.COLA
PEPSI         PEPSI         PEPSI         PEPCSI
MONSTER       MORE          MIRINDA       MSTER

in which each name_match performs a fuzzy match with the real_value column. However there are some instances in which the fuzzy match doesn't perform a good job or the real_value might be impossible/hard to detect (the real_value column has value that a user inserts. These input in some cases is not inserted correctly, hence making it hard to even know what it refers to like the MSTER case).

What I would like to do is select per row the record that repeats the most, and if there's none, then don't select any like the following example:

name_match1 | name_match2 | name_match3 | real_value | most_repeated_value | Times_repeated
COCA COLA     CACAO         COCA COLA     COCA.COLA    COCA COLA                  2
PEPSI         PEPSI         PEPSI         PEPCSI       PEPSI                      3
MONSTER       MORE          MIRINDA       MSTER        NULL                       0

However I don't know how to perform this in R data.table. I was wondering if someone knew a way to perform this?

CodePudding user response:

We can adapt the top answer at the How to find the statistical mode FAQ to return NA if there is no unique mode:

Mode <- function(x) {
  ux <- unique(x)
  tab = tabulate(match(x, ux))
  if(length(which(max(tab) == tab)) > 1) return(NA)
  return(ux[which.max(tab)])
}

Then we need to apply this function to the match columns every row.

library(data.table)
dat = fread(text = 'name_match1  name_match2  name_match3  real_value
"COCA COLA"     CACAO         "COCA COLA"     COCA.COLA
PEPSI         PEPSI         PEPSI         PEPCSI
MONSTER       MORE          MIRINDA       MSTER')

dat[, result := apply(dat[, 1:3], 1, Mode)]
dat
#    name_match1 name_match2 name_match3 real_value    result
# 1:   COCA COLA       CACAO   COCA COLA  COCA.COLA COCA COLA
# 2:       PEPSI       PEPSI       PEPSI     PEPCSI     PEPSI
# 3:     MONSTER        MORE     MIRINDA      MSTER      <NA>
  • Related