I have this df
df = data.frame(x = c('Orange','orange','Appples','orgne','apple','applees','oranges','Oranges',
'orgens','orgaanes','Apples','ORANGES','apple','APPLE') )
using str_replace_all
, I know I can replace each one of these terms to a one unified way of writing each of the 2 words orange and apple but it would take forever if you have a lot of terms in the dataframe. Would wanna a simple way of coding in order to unify all the ways of writing into orange and apple.
CodePudding user response:
You can use agrep
for approximate string matching:
for (i in c("orange", "apple")){
df$x[agrep(i, df$x, max.distance = 2, ignore.case = TRUE)] <- i
df$x
}
#[1] "orange" "orange" "apple" "orange" "apple" "apple" "orange" "orange" "orange" "orgaanes" "apple" "orange" "apple" "apple"
You can change the sensitivity of the distances with max.distance
.
Another possibility is the stringdist
package, which has a number of different distance metrics:
library(stringdist)
v <- c("orange", "apple")
v[amatch(tolower(df$x), v, maxDist = 3)]
CodePudding user response:
The following does not require that we know the unique values so if you have a large number, as stated in the question, it is still feasible to use.
We could assume that if the first k letters are the same regardless of case then they are the same. Using k=2 and picking out the first of those regarded as equal we have:
transform(df, y = ave(x, substr(tolower(x), 1, 2), FUN = function(x) x[1]))
giving:
x y
1 Orange Orange
2 orange Orange
3 Appples Appples
4 orgne Orange
5 apple Appples
6 applees Appples
7 oranges Orange
8 Oranges Orange
9 orgens Orange
10 orgaanes Orange
11 Apples Appples
12 ORANGES Orange
13 apple Appples
14 APPLE Appples
Another possibility is to use the phonics package. It has a number of phonic distances such as soundex and onca. The following happens to give the same output as above. You can play around with the different metrics and their parameters until you get something that works sufficiently well on your real data.
If you knew the unique values then we could use the phonics match or if none just provide the original x.
library(dplyr)
library(phonics)
vals <- c("apples", "oranges")
names(vals) <- soundex(vals, 2)
transform(df, y = coalesce(vals[soundex(x, 2)], x))
giving:
x y
1 Orange oranges
2 orange oranges
3 Appples apples
4 orgne oranges
5 apple apples
6 applees apples
7 oranges oranges
8 Oranges oranges
9 orgens oranges
10 orgaanes oranges
11 Apples apples
12 ORANGES oranges
13 apple apples
14 APPLE apples