Merge two columns with different structures in R-CodePudding

How do I merge two tables that have different structures? Like, I want to cross the category names in "ds_categories" with the category id in "ds". ds has 1109383 lines and ds_categories 2775 lines. To be more specific, I want to link category names to category ids. This complete database is at kaggle: https://www.kaggle.com/sp1thas/book-depository-dataset/code

ds_categories

 category_id  category_name

1   1998    .Net Programming        
2   176     20th Century & Contemporary Classical Music     
3   3291    20th Century & Contemporary Classical Music     
4   2659    20th Century History: C 1900 To C 2000      
5   2661    21st Century History: From C 2000 -     
6   1992    2D Graphics: Games Programming

authors bestsellers.rank categories
1   [1]     49848       [214, 220, 237, 2646, 2647, 2659, 2660, 2679]   
2   [2, 3]  115215      [235, 3386] 
3   [4]     11732       [358, 2630, 360, 2632]  
4   [5, 6, 7, 8]114379  [377, 2978, 2980]   
5   [9]      98413      [2813, 2980]    
6   [10, 11]    90674   [1520, 1532]

I tried this, but it didn't work:

join_cat <- merge(ds, ds_categories, by.x = "categories", by.y = "category_id", all.x = TRUE, all.y = FALSE)

CodePudding user response：

Joining is much easier after unnesting rows associated to multiple categories:

library(tidyverse)

# create some example data
ds_categories <- tribble(
  ~category_id,  ~category_name,
  1, "cat A",
  2, "cat B",
  3, "cat C"
)

ds <- tribble(
  ~authors, ~categories,
  c(1,2), c(1,2),
  3, 1,
  4, c(1,2,3)
)

ds %>%
  unnest(authors) %>%
  unnest(categories) %>%
  rename(category_id = categories) %>%
  left_join(ds_categories)
#> Joining, by = "category_id"
#> # A tibble: 8 x 3
#>   authors category_id category_name
#>     <dbl>       <dbl> <chr>        
#> 1       1           1 cat A        
#> 2       1           2 cat B        
#> 3       2           1 cat A        
#> 4       2           2 cat B        
#> 5       3           1 cat A        
#> 6       4           1 cat A        
#> 7       4           2 cat B        
#> 8       4           3 cat C

^{Created on 2021-12-13 by the reprex package (v2.0.1)}

Try to always normalize your tables to 3NF aka making them tidy.

CodePudding user response：

You need to first perform some data cleaning, get the data for each categories in separate row and then perform the join.

library(dplyr)
library(tidyr)

ds %>%
  mutate(categories = gsub('\\[|\\]', '', categories)) %>%
  separate_rows(categories, sep = ',\\s*', convert = TRUE) %>%
  left_join(ds_categories, by = c('categories' = 'category_id'))

#   authors bestsellers.rank categories category_name                                      
#   <chr>              <dbl>      <int> <chr>                                              
# 1 [1]                49848        214 Biography: General                                 
# 2 [1]                49848        220 Biography: Historical, Political & Military        
# 3 [1]                49848        237 True War  & Combat Stories                         
# 4 [1]                49848       2646 Asian History                                      
# 5 [1]                49848       2647 Middle Eastern History                             
# 6 [1]                49848       2659 20th Century History: C 1900  To C 2000            
# 7 [1]                49848       2660 Postwar 20th Century History, From C 1945 To C 2000
# 8 [1]                49848       2679 Military History                                   
# 9 [2, 3]            115215        235 True Crime Biographies                             
#10 [2, 3]            115215       3386 True Crime Books

data

ds_categories <- read.csv('categories.csv')
ds <- data.frame(authors = c('[1]', '[2, 3]'), 
                      bestsellers.rank = c(49848, 115215), 
                      categories = c('[214, 220, 237, 2646, 2647, 2659, 2660, 2679]', 
                                     '[235, 3386]'))