Home > Software design >  How do I merge two tables based on a partially matched column?
How do I merge two tables based on a partially matched column?

Time:09-09

HAVE1 <- data.frame(NAME = c("Amelia", "Amelia", "Amelia", "Amelia", "Ava", "Ava", "Ava", "Charlotte", "Charlotte", "Charlotte", "Charlotte", "Emma", "Emma", "Olivia", "Olivia"), CLASS1 = c("Bee", "Chicken", "Shrimp", "Turkey", "Horse", "Pig", "Sheep", "Deer", "Dove", "Ducks", "Fish", "Crab", "Rabbit", "Cow", "Goat"))

HAVE2 <- data.frame(
  NAME = c("Amelia", "Amelia", "Amelia", "Ava", "Ava", "Ava", "Charlotte", "Charlotte", "Charlotte", "Charlotte", "Emma", "Emma", "Olivia", "Olivia"),
  CLASS2 = c("Chicken", "Some Shrimp", "TurkeyPU", "Horse-Horsey", "Pig-Farn", "Count-Sheep", "Deer Head", "Dove Count", "Ducks Count", "Fishes", "Crab- Count", "Rabbit-Count", "Cow-Type", "Goat-Feed")
)


WANT <- data.frame(
  NAME = c("Amelia", "Amelia", "Amelia", "Ava", "Ava", "Ava", "Charlotte", "Charlotte", "Charlotte", "Charlotte", "Emma", "Emma", "Olivia", "Olivia"),
  CLASS1 = c("Chicken", "Shrimp", "Turkey", "Horse", "Pig", "Sheep", "Deer", "Dove", "Ducks", "Fish", "Crab", "Rabbit", "Cow", "Goat"),
  CLASS2 = c("Chicken", "Some Shrimp", "TurkeyPU", "Horse-Horsey", "Pig-Farn", "Count-Sheep", "Deer Head", "Dove Count", "Ducks Count", "Fishes", "Crab- Count", "Rabbit-Count", "Cow-Type", "Goat-Feed")
)

I have 'HAVE1' and 'HAVE2' and want to make 'WANT' which merge 'HAVE1' and 'HAVE2' using EXACT MATCH on 'NAME' and subMATCH on 'CLASS' as you can see the word in CLASS1 is also in CLASS2 but CLASS2 has more characters and some are the same

CodePudding user response:

  • We can use
library(tidyverse)

HAVE1 |> full_join(HAVE2) |>
         filter(str_detect(CLASS2 , CLASS1))
  • Output
Joining, by = "NAME"
        NAME  CLASS1       CLASS2
1     Amelia Chicken      Chicken
2     Amelia  Shrimp  Some Shrimp
3     Amelia  Turkey     TurkeyPU
4        Ava   Horse Horse-Horsey
5        Ava     Pig     Pig-Farn
6        Ava   Sheep  Count-Sheep
7  Charlotte    Deer    Deer Head
8  Charlotte    Dove   Dove Count
9  Charlotte   Ducks  Ducks Count
10 Charlotte    Fish       Fishes
11      Emma    Crab  Crab- Count
12      Emma  Rabbit Rabbit-Count
13    Olivia     Cow     Cow-Type
14    Olivia    Goat    Goat-Feed

CodePudding user response:

Here is a base R option using merge subset grepl

subset(
  merge(HAVE1, HAVE2, all = TRUE, sort = FALSE),
  Vectorize(grepl)(CLASS1, CLASS2)
)

which gives

        NAME  CLASS1       CLASS2
4     Amelia Chicken      Chicken
8     Amelia  Shrimp  Some Shrimp
12    Amelia  Turkey     TurkeyPU
13       Ava   Horse Horse-Horsey
17       Ava     Pig     Pig-Farn
21       Ava   Sheep  Count-Sheep
22 Charlotte    Deer    Deer Head
27 Charlotte    Dove   Dove Count
32 Charlotte   Ducks  Ducks Count
37 Charlotte    Fish       Fishes
38      Emma    Crab  Crab- Count
41      Emma  Rabbit Rabbit-Count
42    Olivia     Cow     Cow-Type
45    Olivia    Goat    Goat-Feed

breakdown the code

  1. First, we generate a full-join table of HAVE1 and HAVE2
> merge(HAVE1, HAVE2, all = TRUE, sort = FALSE)
        NAME  CLASS1       CLASS2
1     Amelia     Bee      Chicken
2     Amelia     Bee  Some Shrimp
3     Amelia     Bee     TurkeyPU
4     Amelia Chicken      Chicken
5     Amelia Chicken  Some Shrimp
6     Amelia Chicken     TurkeyPU
7     Amelia  Shrimp      Chicken
8     Amelia  Shrimp  Some Shrimp
9     Amelia  Shrimp     TurkeyPU
10    Amelia  Turkey      Chicken
11    Amelia  Turkey  Some Shrimp
12    Amelia  Turkey     TurkeyPU
13       Ava   Horse Horse-Horsey
14       Ava   Horse     Pig-Farn
15       Ava   Horse  Count-Sheep
16       Ava     Pig Horse-Horsey
17       Ava     Pig     Pig-Farn
18       Ava     Pig  Count-Sheep
19       Ava   Sheep Horse-Horsey
20       Ava   Sheep     Pig-Farn
21       Ava   Sheep  Count-Sheep
22 Charlotte    Deer    Deer Head
23 Charlotte    Deer   Dove Count
24 Charlotte    Deer  Ducks Count
25 Charlotte    Deer       Fishes
26 Charlotte    Dove    Deer Head
27 Charlotte    Dove   Dove Count
28 Charlotte    Dove  Ducks Count
29 Charlotte    Dove       Fishes
30 Charlotte   Ducks    Deer Head
31 Charlotte   Ducks   Dove Count
32 Charlotte   Ducks  Ducks Count
33 Charlotte   Ducks       Fishes
34 Charlotte    Fish    Deer Head
35 Charlotte    Fish   Dove Count
36 Charlotte    Fish  Ducks Count
37 Charlotte    Fish       Fishes
38      Emma    Crab  Crab- Count
39      Emma    Crab Rabbit-Count
40      Emma  Rabbit  Crab- Count
41      Emma  Rabbit Rabbit-Count
42    Olivia     Cow     Cow-Type
43    Olivia     Cow    Goat-Feed
44    Olivia    Goat     Cow-Type
45    Olivia    Goat    Goat-Feed
  1. Then, we use grepl in a rowwise manner (implemented by Vectorized) to check if patterns in CLASS1 can be found in CLASS2 in the above merged table.

CodePudding user response:

We could use fuzzyjoin

library(fuzzyjoin)
regex_inner_join(HAVE2, HAVE1, by = c("NAME", "CLASS2" = "CLASS1")) %>% 
   select(NAME = NAME.x, CLASS1, CLASS2)
        NAME  CLASS1       CLASS2
1     Amelia Chicken      Chicken
2     Amelia  Shrimp  Some Shrimp
3     Amelia  Turkey     TurkeyPU
4        Ava   Horse Horse-Horsey
5        Ava     Pig     Pig-Farn
6        Ava   Sheep  Count-Sheep
7  Charlotte    Deer    Deer Head
8  Charlotte    Dove   Dove Count
9  Charlotte   Ducks  Ducks Count
10 Charlotte    Fish       Fishes
11      Emma    Crab  Crab- Count
12      Emma  Rabbit Rabbit-Count
13    Olivia     Cow     Cow-Type
14    Olivia    Goat    Goat-Feed
  • Related