Home > Back-end >  Mutate column based on list of lists in R
Mutate column based on list of lists in R

Time:12-16

I have a dataframe that I want to gather so that it is in tall format, and then mutate on another column with values based on membership of a string from another column in a list of lists. For example, I have the following data frame and list of lists:

dummy_data <- data.frame("id" = 1:20,"test1_10" = sample(1:100, 20),"test2_11" = sample(1:100, 20),
"test3_12" = sample(1:100, 20),"check1_20" = sample(1:100, 20),
"check2_21" = sample(1:100, 20),"sound1_30" = sample(1:100, 20),
"sound2_31" = sample(1:100, 20),"sound3_32" = sample(1:100, 20))

dummylist <- list(c('test1_','test2_','test3_'),c('check1_','check2_'),c('sound1_','sound2_','sound3_'))
names(dummylist) <- c('shipments','arrivals','departures')

And then I gather the data frame like so:

dummy_data <- dummy_data %>%
  gather("part", "number", 2:ncol(.))

What I want to do is add a column that has the name of the list found in dummylist where the string before the underscore in the part column is a member. And I can do that like this:

dummydata <- dummydata %>%
mutate(Group = case_when(
    str_extract(part,'.*_') %in% dummylist[[1]] ~ names(dummylist[1]),
    str_extract(part,'.*_') %in% dummylist[[2]] ~ names(dummylist[2]),
    str_extract(part,'.*_') %in% dummylist[[3]] ~ names(dummylist[3])
  ))

However, this requires a separate str_extract line for each list/group within the dummylist. And my real data has way more than 3 lists/groups. So I'm wondering if there is a more efficient way to do this mutate step to get the names of the lists in?

Any help is much appreciated, thanks!

CodePudding user response:

It may be easier with a regex_left_join after converting the 'dummylist' to a two column dataset

library(fuzzyjoin)
library(dplyr)
library(tidyr)
library(tibble)
dummy_data %>% 
   # // reshape to long format - pivot_longer instead of gather
   pivot_longer(cols = -id, names_to = 'part', values_to = 'number') %>% 
   # // join with the tibble/data.frame converted dummylist
   regex_left_join(dummylist %>%
        enframe(name = 'Group', value = 'part') %>% 
        unnest(part)) %>% 
   rename(part = part.x) %>%
   select(-part.y)

-output

# A tibble: 160 × 4
      id part      number Group     
   <int> <chr>      <int> <chr>     
 1     1 test1_10      72 shipments 
 2     1 test2_11      62 shipments 
 3     1 test3_12      17 shipments 
 4     1 check1_20     89 arrivals  
 5     1 check2_21     54 arrivals  
 6     1 sound1_30     39 departures
 7     1 sound2_31     94 departures
 8     1 sound3_32     95 departures
 9     2 test1_10      77 shipments 
10     2 test2_11       4 shipments 
# … with 150 more rows

CodePudding user response:

If you prepare your lookup table beforehand, you don't need any extra libraries, but dplyr and tidyr:

lookup <- sapply(
    names(dummylist),
    \(nm) { setNames(rep(nm, length(dummylist[[nm]])), dummylist[[nm]]) }
    ) |>
    setNames(nm = NULL) |>
    unlist()    

lookup

#      test1_       test2_       test3_      check1_      check2_      sound1_      sound2_      sound3_ 
# "shipments"  "shipments"  "shipments"   "arrivals"   "arrivals" "departures" "departures" "departures" 

Now you just gsubing on the fly, and translating your parts, within usual mutate() verb:

dummy_data |>
    pivot_longer(-id, names_to = 'part', values_to = 'number') |>
    mutate(group = lookup[gsub('^(\\w _).*$', '\\1', part)])
    
# # A tibble: 160 × 4
#      id part      number group     
#   <int> <chr>      <int> <chr>     
# 1     1 test1_10      91 shipments 
# 2     1 test2_11      74 shipments 
# 3     1 test3_12      46 shipments 
# 4     1 check1_20     62 arrivals  
# 5     1 check2_21      7 arrivals  
# 6     1 sound1_30     35 departures
# 7     1 sound2_31     23 departures
# 8     1 sound3_32     84 departures
# 9     2 test1_10      59 shipments 
# 10    2 test2_11      73 shipments 
# # … with 150 more rows
  • Related