I want to check if there is a close match between column values and a list of strings. There is rarely a perfect match so %in% is no good. I'd rather err on the side of caution than miss something, however I'd rather avoid matching potential patterns within each individual word
For example
List:
Tenis PLC
Green Company Limited
(DCC) Darth Company Creditors
Dataframe
ID. Company Name
10. Ten LTD
12. Green Company (GC) LTD
23 MCC
48. DARTH
Return
False
True
False
True
EDIT: I should mention I have now cleaned the data a little to make it all lowercase and remove any brackets
CodePudding user response:
To regenerate your data:
l = list(tolower(c('Tenis PLC',
'Green Company Limited',
'(DCC) Darth Company Creditors')))
tmp_df = data.frame(Company_Name=c(tolower(c('Ten LTD', 'Green Company (GC) LTD', 'MCC',
'DARTH'))))
Solution:
- Get all the substring divided by space:
split1 = unlist(strsplit(unlist(l), ' '))
- Find whether or not any of the values in
Company_name
contains them (assuming this is what you meant):
sapply(tmp_df$Company_Name,
function(x) {sum(unlist(strsplit(x, ' ')) %in% split1) >= 1})
EDIT:
To keep items in split1
with at least 3 characters:
split1[sapply(split1, nchar) > 3]