I've the following 2 dataset: The original dataset and the matching dataset. The matching dataset is used to match against the original dataset.
Original Dataset |
---|
MelbourneAir |
MelbourneCityAir |
AirMelbourne |
LondonAir |
londonterminal |
Airportlondon |
NewyorkAirport |
Airnewyork |
newyorkterminal |
airnorway |
terminalnorway |
swedenair |
airsweden |
swedenbus |
shopbanana |
bananashop |
bananaashop |
appleshop |
shopapple |
Original_df = structure(list(df.Original.Data = structure(c(11L, 12L, 1L, 9L,
10L, 4L, 13L, 2L, 14L, 15L, 3L, 20L, 18L, 5L, 19L, 17L, 8L, 7L,
6L, 16L), .Label = c("AirMelbourne", "airnewyork", "airnorway",
"AirportLondon", "airsweden", "appleshop", "bananaashop", "bananashop",
"LondonAir", "LondonTerminal", "MelbourneAir", "MelbourneCityAir",
"newyorkairport", "newyorkterminal", "norwayterminal", "shopapple",
"shopbanana", "swedenair", "swedenbus", "terminalnorway"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
Matching Dataset |
---|
MelbourneAirport |
LondonTerminal |
NewYorkAirport |
NorwayTerminal |
SwedenAirport |
BananaShop |
AppleShop |
Matching_df = structure(list(df.Matching.Data = structure(c(5L, 4L, 6L, 7L,
8L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "AppleShop", "BananaShop", "LondonTerminal",
"MelbourneAirport", "NewYorkAirport", "NorwayTerminal", "SwedenAirport"
), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
head(matching_df,7)
Does anyone know how I can use R and it's packages (stringdist, stringr, stringi, tidyr, dplyr or base R) to do the matching based on consecutive letters (i.e. as long as the text in original dataset matches the text in the matching dataset by 5(or 6) consecutive letters, R will match it accordingly to the following table? I've been using Excel so far to do the matching but given the limit to how much data excel can handle, using R would be more efficient.
Original Dataset | Matched Dataset |
---|---|
MelbourneAir | MelbourneAirport |
MelbourneCityAir | MelbourneAirport |
AirMelbourne | MelbourneAirport |
LondonAir | LondonTerminal |
londonterminal | LondonTerminal |
Airportlondon | LondonTerminal |
NewyorkAirport | NewYorkAirport |
Airnewyork | NewYorkAirport |
newyorkterminal | NewYorkAirport |
airnorway | NorwayTerminal |
terminalnorway | NorwayTerminal |
swedenair | SwedenAirport |
airsweden | SwedenAirport |
swedenbus | SwedenAirport |
shopbanana | BananaShop |
bananashop | BananaShop |
bananaashop | BananaShop |
appleshop | AppleShop |
shopapple | AppleShop |
Thanks in advance!
CodePudding user response:
stringdist
indeed is a way to calculate the distance between two strings.
#load stringdist package
library(stringdist)
#calculate "Optimal string alignment" distance between two strings
sapply(1:nrow(original_df), function(x) stringdist(original_df$df.Original.Data[x], matching_df$df.Matching.Data[x], method = "osa"))
distlist <- list()
for(i in 1:nrow(original_df)) {
distlist[[i]] <- stringdist(tolower(original_df$df.Original.Data[i]), tolower(matching_df$df.Matching.Data), method = "osa")
}
minindexes <- sapply(distlist, which.min)
original_df$result <- Matching_df$df.Matching.Data[minindexes]
> original_df
df.Original.Data result
1 MelbourneAir MelbourneAirport
2 MelbourneCityAir MelbourneAirport
3 AirMelbourne AppleShop
4 LondonAir LondonTerminal
5 LondonTerminal LondonTerminal
6 AirportLondon AppleShop
7 newyorkairport NewYorkAirport
8 airnewyork AppleShop
9 newyorkterminal LondonTerminal
10 norwayterminal NorwayTerminal
11 airnorway BananaShop
12 terminalnorway SwedenAirport
13 swedenair SwedenAirport
14 airsweden AppleShop
15 swedenbus SwedenAirport
16 shopbanana BananaShop
17 bananashop BananaShop
18 bananaashop BananaShop
19 appleshop AppleShop
20 shopapple AppleShop
Several methods are available, see stringdist-metrics