Home > Net >  How to regroup element with common values in two columns
How to regroup element with common values in two columns

Time:09-02

I would like to regroup element if they share a common values in two different columns. What would be the best way?

Input:

tibble(a = c("C1", "C2", "C12", "C15", "C12"), b = c("C4", "C3", "C2", "C18", "C21")) 

Desired output:

list( c("C1", "C4"), c("C2", "C3", "C12", "C21"), c("C15", "C18"))

or

tibble(name = c("C1", "C2", "C12", "C15", "C4", "C3", "C18", "C21"), id_group= c(1, 2, 2, 3, 1, 2,3, 2))

CodePudding user response:

Here's a way using igraph:

library(igraph)
library(dplyr)

graph_from_data_frame(dat) |>
  components() |>
  getElement("membership") |>
  stack() |>
  arrange(values)

#short form:
#stack(components(graph_from_data_frame(dat))[[1]])
  values ind
1      1  C1
2      1  C4
3      2  C2
4      2 C12
5      2  C3
6      2 C21
7      3 C15
8      3 C18

or if you want to get a list rather than a data.frame:

g <- graph_from_data_frame(dat, dir = FALSE) |>
  components() |>
  getElement("membership") |>
  stack()

split(g$ind, f = g$values)
$`1`
[1] "C1" "C4"

$`2`
[1] "C2"  "C12" "C3"  "C21"

$`3`
[1] "C15" "C18"

CodePudding user response:

Here is an algorithm to do it without any other package. Sorry I'm not a pro of tibble, I mostly use data.table so you might want to rearrange things, but the output is what you're looking for.

The idea is to see the rows id as the initial group id, and change iteratively the ids of groups that have an element that is also in another group with this group id. It could be seen as a graph problem, (knowing C1 is linked to C4, C2 to C3, C2 to C12 etc ..., construct the largest dissociated graph (as done by @Maël)

#data
dat = tibble(a = c("C1", "C2", "C12", "C15", "C12"), b = c("C4", "C3", "C2", "C18", "C21")) 


#init the group id, according to your explanation, each row is a group
dat$group = 1:nrow(dat)

#create an array with the list of groups
group_list = dat$group

#foreach row (group)
for (i in group_list){
  #retrieve elements in group i
  element_of_group_i = unlist(c(dat[i,'a'], dat[i,'b']))
  #for each element in group i, check if it is in anothe group
  for (el in element_of_group_i){
    #search for rows in which an element 'el' of group i appear
    #in first column
    other_group_with_el = na.omit(match(el,dat[[1]]))
    #remove the group i, which we already the element 'el' is in
    other_group_with_el = other_group_with_el[other_group_with_el != i]
    
    #for each group in which the element 'el' also belongs to, change the group number to i (the groups merge)
    for (group_id in other_group_with_el){
      dat$group[dat$group== group_id] = i
    }
    
    #in second column
    other_group_with_el = na.omit(match(el,dat[[2]]))
    other_group_with_el = other_group_with_el[other_group_with_el != i]
    #for each group in which the element 'el' also belongs to, change the group number to i (the groups merge)
    for (group_id in other_group_with_el){
      dat$group[dat$group== group_id] = i
    }
  }

}

#construct the desired output from dat
desired_output_1 = list()
index_list = 1
for (group_id in unique(dat$group)){
  row_group = dat[dat$group == group_id,]
  element_of_group = unique(unlist(row_group[,c("a","b")]))
  desired_output_1[[index_list]] = element_of_group
  index_list = index_list  1
}

desired_output_2 = tibble(name  = unique(unlist(dat[,c("a","b")])), id_group=NA )

for (group_id in unique(dat$group)){
  row_group = dat[dat$group == group_id,]
  element_of_group = unique(unlist(row_group[,c("a","b")]))
  desired_output_2$id_group[ desired_output_2$name %in% element_of_group] = group_id
 
}

Result in 1st format:

print(desired_output_1)
[[1]]
[1] "C1" "C4"

[[2]]
[1] "C2"  "C12" "C3"  "C21"

[[3]]
[1] "C15" "C18"

Result in 2nd format:

print(desired_output_2)
# A tibble: 8 x 2
  name  id_group
  <chr>    <int>
1 C1           1
2 C2           5
3 C12          5
4 C15          4
5 C4           1
6 C3           5
7 C18          4
8 C21          5

We could rename the group id to be 1,2,3. I believe what makes this problem more difficult that it could sound and longer to write, is the shape of the initial dataset. Manipulating list from the beginning could be easier.

I hope this helps

  • Related