Home > Net >  Merge where some ids are concatenated in single column
Merge where some ids are concatenated in single column

Time:10-04

I have a dataframe with a column of ids, but for some rows there are multiple ids concatenated together. I want to merge this onto another dataframe using the id, and when the ids are concatenated it handles that and reflects it by having the values in the new columns added also concatenated.

For example I have dataframes

data <- data.frame(
    id = c(1, 4, 3, "2,3", "1,4"),
    value = c(1:5)
)
> data
   id value
1   1     1
2   4     2
3   3     3
4 2,3     4
5 1,4     5

mapping <- data.frame(
    id = 1:4,
    name = c("one", "two", "three", "four")
)
> mapping
  id  name
1  1   one
2  2   two
3  3 three
4  4  four

I would like to end up with

   id value      name
1   1     1       one
2   4     2      four
3   3     3     three
4 2,3     4 two,three
5 1,4     5  one,four

CodePudding user response:

I don't think there's a good way to do this other than to separate, join, and re-concatenate:

library(dplyr)
library(tidyr)
data %>%
  mutate(true_id = row_number()) %>%
  separate_rows(id, convert = TRUE) %>%
  left_join(mapping, by = "id") %>%
  group_by(true_id, value) %>%
  summarize(id = toString(id), name = toString(name), .groups = "drop")
# # A tibble: 5 × 4
#   true_id value id    name      
#     <int> <int> <chr> <chr>     
# 1       1     1 1     one       
# 2       2     2 4     four      
# 3       3     3 3     three     
# 4       4     4 2, 3  two, three
# 5       5     5 1, 4  one, four 

I wasn't sure if your value column would actually be unique, so I added a true_id just in case.

CodePudding user response:

What about something like this. I could think of a few ways. One is longer, but much easier to follow and the other is short, but kind of a mess.

library(tidyverse)


#long and readable
data |>
  mutate(tmp = row_number()) |>
  mutate(id = str_split(id, ",")) |>
  unnest_longer(id) |>
  left_join(mapping |>
              mutate(id = as.character(id)), by = "id") |>
  group_by(tmp) |>
  summarise(id = paste(id, collapse = ","),
            value = value[1],
            name = paste(name, collapse = ","))
#> # A tibble: 5 x 4
#>     tmp id    value name     
#>   <int> <chr> <int> <chr>    
#> 1     1 1         1 one      
#> 2     2 4         2 four     
#> 3     3 3         3 three    
#> 4     4 2,3       4 two,three
#> 5     5 1,4       5 one,four


#short and ugly
data |>
  mutate(name = map_chr(id, \(x)paste(
    mapping$name[which(as.character(mapping$id) %in% str_split(x, ",")[[1]])], 
    collapse = ",") ))
#>    id value      name
#> 1   1     1       one
#> 2   4     2      four
#> 3   3     3     three
#> 4 2,3     4 two,three
#> 5 1,4     5  one,four

CodePudding user response:

greping the data$ids out of the mapping$ids.

mapply(\(x, y) toString(mapping$name[grep(sprintf('[%s]', gsub('\\D', '', x)), y)]), 
       data$id, list(mapping$id))
#     1            4            3          2,3          1,4 
# "one"       "four"      "three" "two, three"  "one, four" 

In order not to have a space after the comma, use paste(., collapse=',') instead of toString.

  • Related