How to replace many strings written in different ways to a unified way of writing the term?-CodePudding

I have this df

df = data.frame(x = c('Orange','orange','Appples','orgne','apple','applees','oranges','Oranges',
                      'orgens','orgaanes','Apples','ORANGES','apple','APPLE') )

using str_replace_all, I know I can replace each one of these terms to a one unified way of writing each of the 2 words orange and apple but it would take forever if you have a lot of terms in the dataframe. Would wanna a simple way of coding in order to unify all the ways of writing into orange and apple.

CodePudding user response：

You can use agrep for approximate string matching:

for (i in c("orange", "apple")){
  df$x[agrep(i, df$x, max.distance = 2, ignore.case = TRUE)] <- i
  df$x
}

#[1] "orange"   "orange"   "apple"    "orange"   "apple"   "apple"    "orange"   "orange"   "orange"   "orgaanes" "apple"    "orange"   "apple"    "apple"

You can change the sensitivity of the distances with max.distance.

Another possibility is the stringdist package, which has a number of different distance metrics:

library(stringdist)
v <- c("orange", "apple")
v[amatch(tolower(df$x), v, maxDist = 3)]

CodePudding user response：

The following does not require that we know the unique values so if you have a large number, as stated in the question, it is still feasible to use.

We could assume that if the first k letters are the same regardless of case then they are the same. Using k=2 and picking out the first of those regarded as equal we have:

transform(df, y = ave(x, substr(tolower(x), 1, 2), FUN = function(x) x[1]))

giving:

          x       y
1    Orange  Orange
2    orange  Orange
3   Appples Appples
4     orgne  Orange
5     apple Appples
6   applees Appples
7   oranges  Orange
8   Oranges  Orange
9    orgens  Orange
10 orgaanes  Orange
11   Apples Appples
12  ORANGES  Orange
13    apple Appples
14    APPLE Appples

Another possibility is to use the phonics package. It has a number of phonic distances such as soundex and onca. The following happens to give the same output as above. You can play around with the different metrics and their parameters until you get something that works sufficiently well on your real data.

If you knew the unique values then we could use the phonics match or if none just provide the original x.

library(dplyr)
library(phonics)

vals <- c("apples", "oranges")
names(vals) <- soundex(vals, 2)
transform(df, y = coalesce(vals[soundex(x, 2)], x))

giving:

          x       y
1    Orange oranges
2    orange oranges
3   Appples  apples
4     orgne oranges
5     apple  apples
6   applees  apples
7   oranges oranges
8   Oranges oranges
9    orgens oranges
10 orgaanes oranges
11   Apples  apples
12  ORANGES oranges
13    apple  apples
14    APPLE  apples