How do I merge two tables that have different structures? Like, I want to cross the category names in "ds_categories" with the category id in "ds". ds has 1109383 lines and ds_categories 2775 lines. To be more specific, I want to link category names to category ids. This complete database is at kaggle: https://www.kaggle.com/sp1thas/book-depository-dataset/code
ds_categories
category_id category_name
1 1998 .Net Programming
2 176 20th Century & Contemporary Classical Music
3 3291 20th Century & Contemporary Classical Music
4 2659 20th Century History: C 1900 To C 2000
5 2661 21st Century History: From C 2000 -
6 1992 2D Graphics: Games Programming
ds
authors bestsellers.rank categories
1 [1] 49848 [214, 220, 237, 2646, 2647, 2659, 2660, 2679]
2 [2, 3] 115215 [235, 3386]
3 [4] 11732 [358, 2630, 360, 2632]
4 [5, 6, 7, 8]114379 [377, 2978, 2980]
5 [9] 98413 [2813, 2980]
6 [10, 11] 90674 [1520, 1532]
I tried this, but it didn't work:
join_cat <- merge(ds, ds_categories, by.x = "categories", by.y = "category_id", all.x = TRUE, all.y = FALSE)
CodePudding user response:
Joining is much easier after unnesting rows associated to multiple categories:
library(tidyverse)
# create some example data
ds_categories <- tribble(
~category_id, ~category_name,
1, "cat A",
2, "cat B",
3, "cat C"
)
ds <- tribble(
~authors, ~categories,
c(1,2), c(1,2),
3, 1,
4, c(1,2,3)
)
ds %>%
unnest(authors) %>%
unnest(categories) %>%
rename(category_id = categories) %>%
left_join(ds_categories)
#> Joining, by = "category_id"
#> # A tibble: 8 x 3
#> authors category_id category_name
#> <dbl> <dbl> <chr>
#> 1 1 1 cat A
#> 2 1 2 cat B
#> 3 2 1 cat A
#> 4 2 2 cat B
#> 5 3 1 cat A
#> 6 4 1 cat A
#> 7 4 2 cat B
#> 8 4 3 cat C
Created on 2021-12-13 by the reprex package (v2.0.1)
Try to always normalize your tables to 3NF aka making them tidy.
CodePudding user response:
You need to first perform some data cleaning, get the data for each categories
in separate row and then perform the join.
library(dplyr)
library(tidyr)
ds %>%
mutate(categories = gsub('\\[|\\]', '', categories)) %>%
separate_rows(categories, sep = ',\\s*', convert = TRUE) %>%
left_join(ds_categories, by = c('categories' = 'category_id'))
# authors bestsellers.rank categories category_name
# <chr> <dbl> <int> <chr>
# 1 [1] 49848 214 Biography: General
# 2 [1] 49848 220 Biography: Historical, Political & Military
# 3 [1] 49848 237 True War & Combat Stories
# 4 [1] 49848 2646 Asian History
# 5 [1] 49848 2647 Middle Eastern History
# 6 [1] 49848 2659 20th Century History: C 1900 To C 2000
# 7 [1] 49848 2660 Postwar 20th Century History, From C 1945 To C 2000
# 8 [1] 49848 2679 Military History
# 9 [2, 3] 115215 235 True Crime Biographies
#10 [2, 3] 115215 3386 True Crime Books
data
ds_categories <- read.csv('categories.csv')
ds <- data.frame(authors = c('[1]', '[2, 3]'),
bestsellers.rank = c(49848, 115215),
categories = c('[214, 220, 237, 2646, 2647, 2659, 2660, 2679]',
'[235, 3386]'))