Is there an option to left_join a list-column?-CodePudding

Suppose I have a data set with sales:

library(tidyverse)
initial_data <- tribble(
  ~ billing_loc, ~ type, ~ billing_comment,
  "New York", "RE", "aaaa ssss 003tt",
  "London", "ZO", "BO",
  "Paris", "ZO", "003sd, 003Wf; 003ghdiscount"
)

First of all, I want to extract all the project ids which are provided in the billing_comment column. A project id always starts with "003" and has a length of 5. I did it with the following code:

modified_data <- initial_data %>% 
  mutate(project_id = str_extract_all(billing_comment, "003.."))

Secondly, I want to use left_join to look up for the information about the ids from the second table and insert it in the list-column:

project_id_database <- tribble(
  ~ project_id, ~ type, ~ owner,
  "003tt", "ZO", "Juan",
  "003sd", "ZO", "Mike",
  "003aA", "RE", "Brent",
  "003Wf", "ZO", "Brent",
  "003gh", "RE", "Anna",
  "003qQ", "ZO", "Donald"
)

Is there a way to use left_join for nested data without unnesting it and get a list-column containing tibbles with all information about those ids (don't be confused with <lgl [1] in the second row, it can either return NA or some string saying that the id is missing)?

  billing_loc type  billing_comment             project_data    
  <chr>       <chr> <chr>                       <list>          
1 New York    RE    aaaa ssss 003tt             <tibble [1 x 3]>
2 London      ZO    BO                          <lgl [1]>       
3 Paris       ZO    003sd, 003Wf; 003ghdiscount <tibble [3 x 3]>

I found a way to do it with unnest() and then use left_join but I think there should be a more effective solution. Finally, it would be nice if there were a solution to add a column with only one id that matches a criteria in column "type" (it several ids matches the criteria it should return the first). For that purpose I used map() but I also think this way is not as effective as it could be because I used the column "status" for that:

test_data_nest %>% 
  mutate(final_project = map(data, ~ filter(., status == "Match"))) %>% 
  mutate(final_project = map(final_opp, 1))

CodePudding user response：

I think separating the rows, extracting the project_ids, attaching the details, nesting, and then joining back would be simpler and faster than trying to map:

initial_data %>%
  separate_rows(billing_comment) %>%
  mutate(project_id = str_extract(billing_comment, "003..")) %>%
  inner_join(project_id_database %>% select(-type), by="project_id") %>%
  group_by(billing_loc, type) %>%
  nest() %>%
  right_join(initial_data, by=c("billing_loc", "type"))

## A tibble: 3 x 4
## Groups:   billing_loc, type [3]
#  billing_loc type  data             billing_comment            
#  <chr>       <chr> <list>           <chr>                      
#1 New York    RE    <tibble [1 x 3]> aaaa ssss 003tt            
#2 Paris       ZO    <tibble [3 x 3]> 003sd, 003Wf; 003ghdiscount
#3 London      ZO    <NULL>           BO

CodePudding user response：

If you want to do this without unnesting the data in long format you can use -

library(tidyverse)

initial_data %>% 
  mutate(project_id = str_extract_all(billing_comment, "003.."), 
         data = map(project_id, 
             ~project_id_database[match(.x, project_id_database$project_id), ])) 

# A tibble: 3 x 5
#  billing_loc type  billing_comment             project_id data            
#  <chr>       <chr> <chr>                       <list>     <list>          
#1 New York    RE    aaaa ssss 003tt             <chr [1]>  <tibble [1 × 3]>
#2 London      ZO    BO                          <chr [0]>  <tibble [0 × 3]>
#3 Paris       ZO    003sd, 003Wf; 003ghdiscount <chr [3]>  <tibble [3 × 3]>