What would be the best approach for grouping data according to a table of keywords in R-CodePudding

I have the following dictionary for grouping data

 1. [aa11, aa21, aa31, aa34],      "group A"
 2. [x23z, x22z, x32z, x35z, x34z],"group B"
 3. [lg32z, lg22z, lg84x, lg94y],  "group C"
 4. ...

The column in the data itself may also have more than one code. What I want is use the dictionary above and assign the groups to the data:

 1. [aa31, aa34], "group A"
 2. [lg94z],      "group C"
 3. [lg84x],      "group C"
 4. [x22z, x23z]  "group B"

What would be the best way to assign the records to the specific group described in the first table.

My thoughts so far: Should the data be first reshaped (tokenized) to have one code per record in both the "dictionary" table and the processed table?

CodePudding user response：

There are two scenarios:

Your code variable might contain several codes, but all from the same group.
Your code variable might contain several codes, but they could be from different groups.

Lookup input for both solutions

library(tidyverse)
library(tidytext)

lookup <-  data.frame(code  = c("[aa11, aa21, aa31, aa34]",
                                "[x23z, x22z, x32z, x35z, x34z]",
                                "[lg32z, lg22z, lg84x, lg94y]"),
                      group = c("group A", "group B", "group C"))

lookup_long <- lookup %>%
  unnest_tokens(words, code, "words")

Solution for 1.

You had a good idea to first tokenize the lookup and the input data and then do the match based on the word tokens. In case you have several codes in your code variable, I assume you just want to keep the information once, i.e. have just ONE column with the group information.

df_1 <- data.frame(code = c("[aa31, aa34]", "[lg94y]", "[lg84x]", "[x22z, x23z]"),
                   id   = 1:4)

df_1 %>%
  unnest_tokens(code_new, code, "words", drop = FALSE) %>%
  left_join(lookup_long, by = c("code_new" = "words")) %>%
  group_by(id) %>%
  distinct(group, .keep_all = TRUE) %>%
  ungroup() %>%
  select(-id, -code_new)

which gives:

# A tibble: 4 x 2
  code         group  
  <chr>        <chr>  
1 [aa31, aa34] group A
2 [lg94y]      group C
3 [lg84x]      group C
4 [x22z, x23z] group B

Solution for 2.

It's generally the same idea, but it needs a bit more reshaping to store the group information for all codes.

df_2 <- data.frame(code = c("[aa31, aa34]", "[lg94y]", "[lg84x]", "[x22z, x23z]", "[x22z, aa11]"),
                   id   = 1:5)

df_2 %>%
  unnest_tokens(code_new, code, "words", drop = FALSE) %>%
  left_join(lookup_long, by = c("code_new" = "words")) %>%
  group_by(id) %>%
  distinct(group, .keep_all = TRUE) %>%
  mutate(id_wide = 1:n()) %>%
  ungroup() %>%
  pivot_wider(values_from  = group,
              names_from   = id_wide,
              names_prefix = "group_",
              id_cols      = code)

which gives:

# A tibble: 5 x 3
  code         group_1 group_2
  <chr>        <chr>   <chr>  
1 [aa31, aa34] group A <NA>   
2 [lg94y]      group C <NA>   
3 [lg84x]      group C <NA>   
4 [x22z, x23z] group B <NA>   
5 [x22z, aa11] group B group A

CodePudding user response：

Tokenizing is probably the best approach. You could make a lookup table manually like this:

a <- gsub("]", "", lookup$pattern, fixed = TRUE)
a <- gsub("[", "", a, fixed = TRUE)
a <- gsub(" ", "", a, fixed = TRUE)
b <- strsplit(a, ",")
c <- Map(cbind, b, lookup$group)
d <- data.frame(do.call(rbind, c))
colnames(d) <- c("value", "group")
d
#>    value   group
#> 1   aa11 group A
#> 2   aa21 group A
#> 3   aa31 group A
#> 4   aa34 group A
#> 5   x23z group B
#> 6   x22z group B
#> 7   x32z group B
#> 8   x35z group B
#> 9   x34z group B
#> 10 lg32z group C
#> 11 lg22z group C
#> 12 lg84x group C
#> 13 lg94y group C

Or you could skip tokenizing and "fuzzy join":

a <- gsub(", ", "|", lookup$pattern, fixed = TRUE)
a <- gsub("[", "", a, fixed = TRUE)
a <- gsub("]", "", a, fixed = TRUE)
lookup2 <- cbind(lookup,a)
lookup2
#>                          pattern   group                        a
#> 1       [aa11, aa21, aa31, aa34] group A      aa11|aa21|aa31|aa34
#> 2 [x23z, x22z, x32z, x35z, x34z] group B x23z|x22z|x32z|x35z|x34z
#> 3   [lg32z, lg22z, lg84x, lg94y] group C  lg32z|lg22z|lg84x|lg94y

for(i in 1:nrow(lookup2)){
  df[grepl(pattern = lookup2$a[i], x = df$V1),"V3"] <- lookup2$group[i]
}
df
#>             V1      V2      V3
#> 1 [aa31, aa34] group A group A
#> 2      [lg94z] group C    <NA>
#> 3      [lg84x] group C group C
#> 4 [x22z, x23z] group B group B
Created on 2021-09-22 by the reprex package (v2.0.1)

Data:

lookup <- data.frame(
  pattern = c("[aa11, aa21, aa31, aa34]",
              "[x23z, x22z, x32z, x35z, x34z]",
              "[lg32z, lg22z, lg84x, lg94y]"),
  group = c("group A", "group B", "group C"))

df <- data.frame(
  V1 = c("[aa31, aa34]", "[lg94z]", "[lg84x]", "[x22z, x23z]"),
  V2 = c("group A", "group C", "group C", "group B"))

Note that there is a typo in your input data, so it returns NA at df[2,"V3"]. I am pretty sure the loop can be vectorized, I just can't think of how at the moment. I'll update if it comes to me.