I am trying to find a solution for below problem but unable to do so, Please help me with the logic, I am okay with a direction also, I can code but unable to came up with necessary logic.
Here is the problem:
I have a dataset like this:
df <- data.frame(x = c(1, 1, 2, 2, 2, 2, 3, 4, 5, 1, 2,3, 4, 7, 8, 9, 4, 10, 10, 11, 12), y = c('a', 'b', 'c','d', 'e', 'f', 'a', 'a', 'e', 'k', 'l', 'm', 'n', 'b', 'e', 'e', 'b', 'x', 'y', 'z', 'z' ))
The expected output should be:
col1 col2
1 1,4,3, 7
2 2,5,8,9
10 10
11 11, 12
Logic : consider x column as a key, y as value, now if a key is repeated with different values of value(y) then all these keys are connected,as an example: 1 is connected with a and b, k, but a is also connected to 4,3, 7. Now 2 is connected to c d e and f, so 2, 5, 8, 9 are also connected similarly z is connected with 11 and 12. To made the final connection we can concatenate them together and take out the first key as key rest as comma separated values.
My attempt
library(data.table)
setDT(df)
setnames(df, c('x', 'y'),c('los', 'mob'))
dfLos <- df[, .(mobconcat = paste0(mob, collapse = ',')), .(los)]
dfMob <- df[, .(losconcat = paste0(los, collapse = ',')), .(mob)]
df <- merge(df, dfMob, by='mob', all.x=TRUE)
dim(df)
df <- merge(df, dfLos, by='los', all.x=TRUE)
(showing only few rows)
los mob losconcat mobconcat
1: 1 a 1,3,4 a,b,k
2: 1 b 1,7,4 a,b,k
3: 1 k 1 a,b,k
4: 2 c 2 c,d,e,f,l
5: 2 d 2 c,d,e,f,l
6: 2 e 2,5,8,9 c,d,e,f,l
7: 2 f 2 c,d,e,f,l
8: 2 l 2 c,d,e,f,l
Now I am stuck at here, may be this approach is waste, the approach I am trying here to get all the keys against all the keys with comma separated but unable to understand how to make it further.
Thank you very much. You are welcome to suggest any other approach or expand on mine, I am okay with anyone.
CodePudding user response:
Another not very beautiful solution, but it's data.table based. Maybe it helps.
library(data.table)
library(stringr)
df = data.table(x = c(1, 1, 2, 2, 2, 2, 3, 4, 5, 1, 2,3, 4, 7, 8, 9, 4, 10, 10, 11, 12), y = c('a', 'b', 'c','d', 'e', 'f', 'a', 'a', 'e', 'k', 'l', 'm', 'n', 'b', 'e', 'e', 'b', 'x', 'y', 'z', 'z' ))
df2 = df[, .(conc = str_c(x, collapse = ", ")), by=y]
df3 = merge(df, df2, by="y", all=TRUE)
df4 = unique(df3[, .(conc = unlist(strsplit(str_c(conc, collapse = ", "), ", "))), by=x][order(conc)], by="conc")
df_final = [, .(conc = str_c(conc, collapse = ", ")), by=x][order(x)]
Output:
x conc
1: 1 1, 3, 4, 7
2: 2 2, 5, 8, 9
3: 10 10
4: 11 11, 12
CodePudding user response:
Here is a function that should do what you want. It may not be the fastest or most elegant answer...
key_connect <- function(data, key, val) {
val_by_key <- split(data[[val]], data[[key]])
key_by_val <- split(data[[key]], data[[val]])
f <- function(val) unique(unlist(key_by_val[val]))
res <- data.frame(
## Unique keys
key = as.integer(names(val_by_key)),
## Connected keys with duplicates
key_con = I(lapply(val_by_key, f))
)
ul <- unlist(res[["key_con"]], use.names = FALSE)
ends <- cumsum(lengths(res[["key_con"]]))
ends <- c(0L, ends[-length(ends)])
g <- function(i, end) i[match(i, ul[seq_len(end)], 0L) == 0L]
## Connected keys without duplicates
res[["key_con"]] <- Map(g, i = res[["key_con"]], end = ends)
res <- res[lengths(res[["key_con"]]) > 0L, ]
row.names(res) <- NULL
res
}
key_connect(df, key = "x", val = "y")
key key_con
1 1 1, 3, 4, 7
2 2 2, 5, 8, 9
3 10 10
4 11 11, 12