Fix ID combination column to include all possible colmbinations based on other column-CodePudding

I have a large dataframe with some ID's that are unique. However, many refer still to the same object. Therefore, I need to create a dataframe that contains two columns: one with the unique ID's, and one with all ID's that refer to the same underlying object. In this case, all of the rows refer to the same object because they are all linked together through the second column.

I already have an ID combination column, but some combinations are lacking, which is what I am trying to adress.

Roughly, the data now looks like this:

> tibble(id = c("x", "y", "z", "q", "w", "p"), t = c("x; y", "x; y; z", "y; z", "q", "w; p", "p")) 
# A tibble: 6 × 2
  id    t      
  <chr> <chr>  
1 x     x; y   
2 y     x; y; z
3 z     y; z   
4 q     q      
5 w     w; p   
6 p     p

And, what I should end up with is the following.

> tibble(id = c("x", "y", "z", "q", "w", "p"), t = c("x; y; z", "x; y; z", "x; y; z", "q", "w; p", "w; p"))
# A tibble: 6 × 2
  id    t      
  <chr> <chr>  
1 x     x; y; z
2 y     x; y; z
3 z     x; y; z
4 q     q      
5 w     w; p   
6 p     w; p

I have tried various forms of collapsing the strings, but it is the ones that are not linked together through the second column I can't get linked, i.e., in this case the ID's y and z.

Hope it make sense, any help is appreciated :-)!

EDIT: Basically, I want to loop through all values of id to see if they have a (partial string) match in t and then return both id and t - potentially pasted together, and then I can take the unique values in t after.

CodePudding user response：

Here is a base R method:

combos  <- strsplit(dat$t, "; ")

dat$t  <- lapply(dat$id, \(id) {
    combos[sapply(combos, \(combo) id %in% combo)]  |>
    unlist()  |>
    unique()  |>
    paste(collapse = "; ")
})  |>
    unlist()

dat
# # A tibble: 6 x 2
#   id    t        
#   <chr> <chr>    
# 1 x     x; y; z  
# 2 y     x; y; z  
# 3 z     x; y; z  
# 4 q     q        
# 5 w     w; p
# 6 p     w; p

CodePudding user response：

A tidyverse / igraph method:

library(tidyverse)
library(igraph)
tib %>% 
  separate_rows(t) %>% 
  inner_join(., ., by = "id") %>% 
  select(from = t.x, to = t.y, id) %>% 
  graph_from_data_frame(directed = FALSE) %>% 
  components() %>% 
  pluck(membership) %>% 
  mutate(.data = tib, 
         m = ave(names(.), ., FUN = \(x) paste(x, collapse = "; "))) %>% 
  select(id, m)

# A tibble: 6 × 2
  id    m      
  <chr> <chr>  
1 x     x; y; z
2 y     x; y; z
3 z     x; y; z
4 q     q      
5 w     w; p   
6 p     w; p