Home > front end >  Merge two columns with different structures in R
Merge two columns with different structures in R

Time:12-14

How do I merge two tables that have different structures? Like, I want to cross the category names in "ds_categories" with the category id in "ds". ds has 1109383 lines and ds_categories 2775 lines. To be more specific, I want to link category names to category ids. This complete database is at kaggle: https://www.kaggle.com/sp1thas/book-depository-dataset/code

ds_categories

 category_id  category_name

1   1998    .Net Programming        
2   176     20th Century & Contemporary Classical Music     
3   3291    20th Century & Contemporary Classical Music     
4   2659    20th Century History: C 1900 To C 2000      
5   2661    21st Century History: From C 2000 -     
6   1992    2D Graphics: Games Programming

ds

authors bestsellers.rank categories
1   [1]     49848       [214, 220, 237, 2646, 2647, 2659, 2660, 2679]   
2   [2, 3]  115215      [235, 3386] 
3   [4]     11732       [358, 2630, 360, 2632]  
4   [5, 6, 7, 8]114379  [377, 2978, 2980]   
5   [9]      98413      [2813, 2980]    
6   [10, 11]    90674   [1520, 1532]

I tried this, but it didn't work:

join_cat <- merge(ds, ds_categories, by.x = "categories", by.y = "category_id", all.x = TRUE, all.y = FALSE)

CodePudding user response:

Joining is much easier after unnesting rows associated to multiple categories:

library(tidyverse)

# create some example data
ds_categories <- tribble(
  ~category_id,  ~category_name,
  1, "cat A",
  2, "cat B",
  3, "cat C"
)

ds <- tribble(
  ~authors, ~categories,
  c(1,2), c(1,2),
  3, 1,
  4, c(1,2,3)
)

ds %>%
  unnest(authors) %>%
  unnest(categories) %>%
  rename(category_id = categories) %>%
  left_join(ds_categories)
#> Joining, by = "category_id"
#> # A tibble: 8 x 3
#>   authors category_id category_name
#>     <dbl>       <dbl> <chr>        
#> 1       1           1 cat A        
#> 2       1           2 cat B        
#> 3       2           1 cat A        
#> 4       2           2 cat B        
#> 5       3           1 cat A        
#> 6       4           1 cat A        
#> 7       4           2 cat B        
#> 8       4           3 cat C

Created on 2021-12-13 by the reprex package (v2.0.1)

Try to always normalize your tables to 3NF aka making them tidy.

CodePudding user response:

You need to first perform some data cleaning, get the data for each categories in separate row and then perform the join.

library(dplyr)
library(tidyr)

ds %>%
  mutate(categories = gsub('\\[|\\]', '', categories)) %>%
  separate_rows(categories, sep = ',\\s*', convert = TRUE) %>%
  left_join(ds_categories, by = c('categories' = 'category_id'))

#   authors bestsellers.rank categories category_name                                      
#   <chr>              <dbl>      <int> <chr>                                              
# 1 [1]                49848        214 Biography: General                                 
# 2 [1]                49848        220 Biography: Historical, Political & Military        
# 3 [1]                49848        237 True War  & Combat Stories                         
# 4 [1]                49848       2646 Asian History                                      
# 5 [1]                49848       2647 Middle Eastern History                             
# 6 [1]                49848       2659 20th Century History: C 1900  To C 2000            
# 7 [1]                49848       2660 Postwar 20th Century History, From C 1945 To C 2000
# 8 [1]                49848       2679 Military History                                   
# 9 [2, 3]            115215        235 True Crime Biographies                             
#10 [2, 3]            115215       3386 True Crime Books                  

data

ds_categories <- read.csv('categories.csv')
ds <- data.frame(authors = c('[1]', '[2, 3]'), 
                      bestsellers.rank = c(49848, 115215), 
                      categories = c('[214, 220, 237, 2646, 2647, 2659, 2660, 2679]', 
                                     '[235, 3386]'))
  • Related