Home > Net >  Compare 2 Text column of 2 Data frame by consecutive letter in R
Compare 2 Text column of 2 Data frame by consecutive letter in R

Time:07-13

I've the following 2 dataset: The original dataset and the matching dataset. The matching dataset is used to match against the original dataset.

Original Dataset
MelbourneAir
MelbourneCityAir
AirMelbourne
LondonAir
londonterminal
Airportlondon
NewyorkAirport
Airnewyork
newyorkterminal
airnorway
terminalnorway
swedenair
airsweden
swedenbus
shopbanana
bananashop
bananaashop
appleshop
shopapple
Original_df = structure(list(df.Original.Data = structure(c(11L, 12L, 1L, 9L, 
10L, 4L, 13L, 2L, 14L, 15L, 3L, 20L, 18L, 5L, 19L, 17L, 8L, 7L, 
6L, 16L), .Label = c("AirMelbourne", "airnewyork", "airnorway", 
"AirportLondon", "airsweden", "appleshop", "bananaashop", "bananashop", 
"LondonAir", "LondonTerminal", "MelbourneAir", "MelbourneCityAir", 
"newyorkairport", "newyorkterminal", "norwayterminal", "shopapple", 
"shopbanana", "swedenair", "swedenbus", "terminalnorway"), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))
Matching Dataset
MelbourneAirport
LondonTerminal
NewYorkAirport
NorwayTerminal
SwedenAirport
BananaShop
AppleShop
Matching_df = structure(list(df.Matching.Data = structure(c(5L, 4L, 6L, 7L, 
8L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("", "AppleShop", "BananaShop", "LondonTerminal", 
"MelbourneAirport", "NewYorkAirport", "NorwayTerminal", "SwedenAirport"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-20L))

head(matching_df,7)

Does anyone know how I can use R and it's packages (stringdist, stringr, stringi, tidyr, dplyr or base R) to do the matching based on consecutive letters (i.e. as long as the text in original dataset matches the text in the matching dataset by 5(or 6) consecutive letters, R will match it accordingly to the following table? I've been using Excel so far to do the matching but given the limit to how much data excel can handle, using R would be more efficient.

Original Dataset Matched Dataset
MelbourneAir MelbourneAirport
MelbourneCityAir MelbourneAirport
AirMelbourne MelbourneAirport
LondonAir LondonTerminal
londonterminal LondonTerminal
Airportlondon LondonTerminal
NewyorkAirport NewYorkAirport
Airnewyork NewYorkAirport
newyorkterminal NewYorkAirport
airnorway NorwayTerminal
terminalnorway NorwayTerminal
swedenair SwedenAirport
airsweden SwedenAirport
swedenbus SwedenAirport
shopbanana BananaShop
bananashop BananaShop
bananaashop BananaShop
appleshop AppleShop
shopapple AppleShop

Thanks in advance!

CodePudding user response:

stringdist indeed is a way to calculate the distance between two strings.

#load stringdist package
library(stringdist)
#calculate "Optimal string alignment" distance between two strings
sapply(1:nrow(original_df), function(x) stringdist(original_df$df.Original.Data[x], matching_df$df.Matching.Data[x], method = "osa"))

distlist <- list()
for(i in 1:nrow(original_df)) {
  distlist[[i]] <- stringdist(tolower(original_df$df.Original.Data[i]), tolower(matching_df$df.Matching.Data), method = "osa")
}
minindexes <- sapply(distlist, which.min)
original_df$result <- Matching_df$df.Matching.Data[minindexes]


> original_df
   df.Original.Data           result
1      MelbourneAir MelbourneAirport
2  MelbourneCityAir MelbourneAirport
3      AirMelbourne        AppleShop
4         LondonAir   LondonTerminal
5    LondonTerminal   LondonTerminal
6     AirportLondon        AppleShop
7    newyorkairport   NewYorkAirport
8        airnewyork        AppleShop
9   newyorkterminal   LondonTerminal
10   norwayterminal   NorwayTerminal
11        airnorway       BananaShop
12   terminalnorway    SwedenAirport
13        swedenair    SwedenAirport
14        airsweden        AppleShop
15        swedenbus    SwedenAirport
16       shopbanana       BananaShop
17       bananashop       BananaShop
18      bananaashop       BananaShop
19        appleshop        AppleShop
20        shopapple        AppleShop

Several methods are available, see stringdist-metrics

enter image description here

  • Related