Good evening.
Currently I have a R data.table
with the following structure
name_match1 | name_match2 | name_match3 | real_value
COCA COLA CACAO COCA COLA COCA.COLA
PEPSI PEPSI PEPSI PEPCSI
MONSTER MORE MIRINDA MSTER
in which each name_match
performs a fuzzy match with the real_value
column.
However there are some instances in which the fuzzy match doesn't perform a good job or
the real_value
might be impossible/hard to detect (the real_value
column has value that a user inserts. These input in some cases is not inserted correctly, hence making it hard to even know what it refers to like the MSTER case).
What I would like to do is select per row the record that repeats the most, and if there's none, then don't select any like the following example:
name_match1 | name_match2 | name_match3 | real_value | most_repeated_value | Times_repeated
COCA COLA CACAO COCA COLA COCA.COLA COCA COLA 2
PEPSI PEPSI PEPSI PEPCSI PEPSI 3
MONSTER MORE MIRINDA MSTER NULL 0
However I don't know how to perform this in R data.table
. I was wondering if someone knew a way to perform this?
CodePudding user response:
We can adapt the top answer at the How to find the statistical mode FAQ to return NA
if there is no unique mode:
Mode <- function(x) {
ux <- unique(x)
tab = tabulate(match(x, ux))
if(length(which(max(tab) == tab)) > 1) return(NA)
return(ux[which.max(tab)])
}
Then we need to apply this function to the match columns every row.
library(data.table)
dat = fread(text = 'name_match1 name_match2 name_match3 real_value
"COCA COLA" CACAO "COCA COLA" COCA.COLA
PEPSI PEPSI PEPSI PEPCSI
MONSTER MORE MIRINDA MSTER')
dat[, result := apply(dat[, 1:3], 1, Mode)]
dat
# name_match1 name_match2 name_match3 real_value result
# 1: COCA COLA CACAO COCA COLA COCA.COLA COCA COLA
# 2: PEPSI PEPSI PEPSI PEPCSI PEPSI
# 3: MONSTER MORE MIRINDA MSTER <NA>