Finding the unique key pair with same values-CodePudding

I am trying to find a solution for below problem but unable to do so, Please help me with the logic, I am okay with a direction also, I can code but unable to came up with necessary logic.

Here is the problem:

I have a dataset like this:

df <- data.frame(x = c(1, 1, 2, 2, 2, 2, 3, 4, 5, 1, 2,3, 4, 7, 8, 9, 4, 10, 10, 11, 12), y = c('a', 'b', 'c','d', 'e', 'f', 'a', 'a', 'e', 'k', 'l', 'm', 'n', 'b', 'e', 'e', 'b',  'x', 'y', 'z', 'z'  ))

The expected output should be:

  col1  col2
  1     1,4,3, 7
  2     2,5,8,9
  10    10
  11    11, 12

Logic : consider x column as a key, y as value, now if a key is repeated with different values of value(y) then all these keys are connected,as an example: 1 is connected with a and b, k, but a is also connected to 4,3, 7. Now 2 is connected to c d e and f, so 2, 5, 8, 9 are also connected similarly z is connected with 11 and 12. To made the final connection we can concatenate them together and take out the first key as key rest as comma separated values.

My attempt

library(data.table)
setDT(df)
setnames(df, c('x', 'y'),c('los', 'mob'))

dfLos <- df[, .(mobconcat = paste0(mob, collapse = ',')), .(los)]
dfMob <- df[, .(losconcat = paste0(los, collapse = ',')), .(mob)]

df <- merge(df, dfMob, by='mob', all.x=TRUE)
dim(df)
df <- merge(df, dfLos, by='los', all.x=TRUE)
(showing  only few rows)
   los mob losconcat mobconcat
 1:   1   a     1,3,4     a,b,k
 2:   1   b     1,7,4     a,b,k
 3:   1   k         1     a,b,k
 4:   2   c         2 c,d,e,f,l
 5:   2   d         2 c,d,e,f,l
 6:   2   e   2,5,8,9 c,d,e,f,l
 7:   2   f         2 c,d,e,f,l
 8:   2   l         2 c,d,e,f,l

Now I am stuck at here, may be this approach is waste, the approach I am trying here to get all the keys against all the keys with comma separated but unable to understand how to make it further.

Thank you very much. You are welcome to suggest any other approach or expand on mine, I am okay with anyone.

CodePudding user response：

Another not very beautiful solution, but it's data.table based. Maybe it helps.

library(data.table)
library(stringr)

df = data.table(x = c(1, 1, 2, 2, 2, 2, 3, 4, 5, 1, 2,3, 4, 7, 8, 9, 4, 10, 10, 11, 12), y = c('a', 'b', 'c','d', 'e', 'f', 'a', 'a', 'e', 'k', 'l', 'm', 'n', 'b', 'e', 'e', 'b',  'x', 'y', 'z', 'z'  ))


df2 = df[, .(conc = str_c(x, collapse = ", ")), by=y]
df3 = merge(df, df2, by="y", all=TRUE)
df4 = unique(df3[, .(conc = unlist(strsplit(str_c(conc, collapse = ", "), ", "))), by=x][order(conc)], by="conc")
df_final = [, .(conc = str_c(conc, collapse = ", ")), by=x][order(x)]

Output:

    x       conc
1:  1 1, 3, 4, 7
2:  2 2, 5, 8, 9
3: 10         10
4: 11     11, 12

CodePudding user response：

Here is a function that should do what you want. It may not be the fastest or most elegant answer...

key_connect <- function(data, key, val) {
  val_by_key <- split(data[[val]], data[[key]])
  key_by_val <- split(data[[key]], data[[val]])
  f <- function(val) unique(unlist(key_by_val[val]))
  res <- data.frame(
    ## Unique keys
    key = as.integer(names(val_by_key)),
    ## Connected keys with duplicates
    key_con = I(lapply(val_by_key, f))
  )
  ul <- unlist(res[["key_con"]], use.names = FALSE)
  ends <- cumsum(lengths(res[["key_con"]]))
  ends <- c(0L, ends[-length(ends)])
  g <- function(i, end) i[match(i, ul[seq_len(end)], 0L) == 0L]
  ## Connected keys without duplicates
  res[["key_con"]] <- Map(g, i = res[["key_con"]], end = ends)
  res <- res[lengths(res[["key_con"]]) > 0L, ]
  row.names(res) <- NULL
  res
}

key_connect(df, key = "x", val = "y")

  key    key_con
1   1 1, 3, 4, 7
2   2 2, 5, 8, 9
3  10         10
4  11     11, 12