Fill in a matrix based on whether the index rows are similar or different-CodePudding

I have a very large pairwise distance matrix in R. I'd like to code cell in the matrix based on whether the row/column names are the same or different.

On a smaller scale, the row/column names would be:

individuals <- c("apple", "pear", "apple", "cranberry", "peach", "apple")

I would like a matrix with 1 for each comparison involving apple, except for comparisons of apple to apple. That would look like:

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] "0"  "1"  "1"  "1"  "1"  "1" 
[2,] "1"  "0"  "1"  "0"  "0"  "1" 
[3,] "1"  "1"  "0"  "1"  "1"  "1" 
[4,] "1"  "0"  "1"  "0"  "0"  "1" 
[5,] "1"  "0"  "1"  "0"  "0"  "1" 
[6,] "1"  "1"  "1"  "1"  "1"  "0"

I know I can achieve this by doing:

final.matrix <- matrix(nrow= length(individuals), ncol = length(individuals))
final.matrix[grep("apples", individuals),] <- 1
final.matrix[,grep("apples", individuals)] <- 1
diag(final.matrix) <- 0
final.matrix[is.na(final.matrix)] <- 0

But there's gotta be a cleaner/simpler way. What am I missing?

Additionally, this doesn't work when the row/column names are a tibble, which is how they are in reality. Suggestions for a solution that works with tibbles?

tibble_inds <- as_tibble(individuals)
grep("apple", tibble_inds)
# 1

CodePudding user response：

It sounds like you want

outer(x, x, function(a, b) as.integer(a   b == 1L))

where

x <- tibble_inds[[1L]] == "apple"

if you accept only "apple" or

x <- grepl("apple", tibble_inds[[1L]])

if you accept any string having "apple" as a substring.

I am assuming that your character vector individuals is the first variable in tibble_inds. In this case, outer returns

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    1    0    1    1    0
## [2,]    1    0    1    0    0    1
## [3,]    0    1    0    1    1    0
## [4,]    1    0    1    0    0    1
## [5,]    1    0    1    0    0    1
## [6,]    0    1    0    1    1    0

for both choices of x. This result doesn't match yours, because your diag<- call misses [1,3], [3,1], [3,6], [6,3], [1,6], and [6,1].

CodePudding user response：

Another possible solution:

individuals <- c("apple", "pear", "apple", "cranberry", "peach", "apple")

m <- matrix(0, length(individuals), length(individuals))

for (i in 1:length(individuals))
  for (j in 1:length(individuals))
    m[i, j] <-  (sum(c(individuals[i], individuals[j]) == "apple") == 1)

m

#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    0    1    0    1    1    0
#> [2,]    1    0    1    0    0    1
#> [3,]    0    1    0    1    1    0
#> [4,]    1    0    1    0    0    1
#> [5,]    1    0    1    0    0    1
#> [6,]    0    1    0    1    1    0

Or replacing the nested for loop by a nested sapply:

m <- matrix(0, length(individuals), length(individuals))

sapply(1:length(individuals), \(i) sapply(1:length(individuals),
    \(j) m[i,j] <-  (sum(c(individuals[i], individuals[j]) == "apple") == 1)))

#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    0    1    0    1    1    0
#> [2,]    1    0    1    0    0    1
#> [3,]    0    1    0    1    1    0
#> [4,]    1    0    1    0    0    1
#> [5,]    1    0    1    0    0    1
#> [6,]    0    1    0    1    1    0

CodePudding user response：

We can try outer like below

> x <- grepl("apple",individuals)

>  (outer(x, x, ` `) == 1)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    1    0    1    1    0
[2,]    1    0    1    0    0    1
[3,]    0    1    0    1    1    0
[4,]    1    0    1    0    0    1
[5,]    1    0    1    0    0    1
[6,]    0    1    0    1    1    0