split() returns "longer object length is not a multiple of shorter object length"-CodePudding

Context

I asked this question recently:

Comparing partitions from split() using a nested for loop containing an if statement

where I needed to compare partitions generated by split() from a distance matrix using the code fix provided by @robertdj

set.seed(1234) # set random seed for reproducibility

# generate random normal variates
x <- rnorm(5)
y <- rnorm(5)

df <- data.frame(x, y) # merge vectors into dataframe
d <- dist(x) # generate distance matrix

splt <- split(d, 1:5) # split data with 5 values in each partition


for (i in 1:length(splt)) {
for (j in 1:length(splt)) {
    if (i != j) {
        a <- length(which(splt[[i]] >= min(splt[[j]]))) / length(splt[[i]])
        b <- length(which(splt[[j]] <= max(splt[[i]]))) / length(splt[[j]])
        }
    }
}

I generated a MWE where each split contained the same number of elements. I did this just for illustrative purposes, fully knowing that this would not necessarily hold for real data.

As per @Robert Hacken's comment if I instead do

d <- na.omit(d[lower.tri(d)])

I get partitions of unequal length.

Real Data

However my real data does not have the "same size" property. My real data contains many more partitions than only 5 in my MWE.

Here is my code

splt <- split(dist_matrix, sub("(?:(.*)\\|){2}(\\w )\\|(\\w )\\|.*?$", "\\1-\\2", colnames(dist_matrix)))

The distance matrix dist_matrix contains FASTA headers from which I extract the species names.

I then use splt above in the doubly nested loop.

For instance, splt[[4]] contains 5 values, whereas splt[[10]] contains 9.

splt[[4]]
[1] 0.1316667 0.1383333 0.1166667 0.1333333 0.1216667

splt[[10]]
 [1] 0.1450000 0.1483333 0.1316667 0.1316667 0.1333333 0.1333333 0.1166667 0.1166667 0.1200000

Expected Output

For my real problem, each partition corresponds to distances for a single species to all other unique species. So, if Species X has two DNA sequences representing it and there are 10 species in total, the partition for Species X should contain 20 distances. However I don't want the partition to include the distance between the two sequences for species A.

splt would thus contain 10 partitions (each not necessarily of the same length) for all species

The expected output of a and b is a number between 0-1 inclusive. I think these numbers should be small in my real example, but they are large when I try to run my code, which I think is a consequence of the warning().

What I've Done

I've read on SO that %in% is typically used to resolve the warning

In splt[[i]] == splt[[j]] :
  longer object length is not a multiple of shorter object length

except in my case, I believe I would need %notin% <- Negate(%in%).

However, %notin% gives the error in my original post

the condition has length > 1

Question

How can my nested loop be altered to remove the warning?

CodePudding user response：

I'm going to go out on a limb by interpreting parts of what you say, discarding your code, and seeing what I can come up with. If nothing else, it may spark conversation to explain what about my interpretations are correct (and which are incorrect).

Starting with the splt as generated by the random data, then replacing elements 4 and 5 with longer vectors,

set.seed(1234)
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y)
d <- dist(x)
splt <- split(d, 1:5)
splt[[4]] <- rnorm(4)
splt[[5]] <- rnorm(10)

We have:

splt <- list("1" = c(1.48449499149608, 2.62312694474001), "2" = c(2.29150692606848, 0.15169544670039), "3" = c(1.13863195324393, 3.43013887931241), "4" = c(-0.477192699753547, -0.998386444859704, -0.77625389463799, 0.0644588172762693), "5" = c(-0.693720246937475, -1.44820491038647, 0.574755720900728, -1.02365572296388, -0.0151383003641817, -0.935948601168394, 1.10229754620026, -0.475593078869057, -0.709440037512506, -0.501258060594761))
splt
# $`1`
# [1] 1.484495 2.623127
# $`2`
# [1] 2.2915069 0.1516954
# $`3`
# [1] 1.138632 3.430139
# $`4`
# [1] -0.47719270 -0.99838644 -0.77625389  0.06445882
# $`5`
#  [1] -0.6937202 -1.4482049  0.5747557 -1.0236557 -0.0151383 -0.9359486  1.1022975 -0.4755931 -0.7094400 -0.5012581

You reference expressions like which(splt[[i]] >= min(splt[[j]])), which I'm interpreting to mean *"what is the ratio of splt[[i]] that is above the max value in splt[[j]]. Since we're comparing (for example) splt[[1]] with all of splt[[2]] through splt[[5]] here, and likewise for the others, we're going to have a square matrix where the diagonal is splt[[i]]-vs-splt[[i]] (likely not interesting).

Some quick math so we know what we should end up with:

splt[[1]]
# [1] 1.484495 2.623127
range(splt[[2]])
# [1] 0.1516954 2.2915069

Since 1 from [[1]] is greater than 2's max of 2.29, we expect 0.5 in a comparison between the two (for >= max(.)); similarly, none of [[1]] is below 0.15, so we expect a 0 there.

Similarly, [[5]] over [[4]]:

splt[[5]]
#  [1] -0.6937202 -1.4482049  0.5747557 -1.0236557 -0.0151383 -0.9359486  1.1022975 -0.4755931 -0.7094400 -0.5012581
range(splt[[4]])
# [1] -0.99838644  0.06445882

### 2 of 10 are greater than the max
sum(splt[[5]] >= max(splt[[4]])) / length(splt[[5]])
# [1] 0.2

### 9 of 10 are lesser than the min
sum(splt[[5]] <= min(splt[[4]])) / length(splt[[5]])
# [1] 0.2

We can use outer, but sometimes that can be confusing, especially since in this case we'd need to Vectorize the anon-func passed to it. I'll adapt your double-for loop premise into nested sapply calls.

Greater than the other's max

sapply(splt, function(y) sapply(setNames(splt, paste0("max", seq_along(splt))), function(z) sum(y >= max(z)) / length(y)))
#        1   2   3    4   5
# max1 0.5 0.0 0.5 0.00 0.0
# max2 0.5 0.5 0.5 0.00 0.0
# max3 0.0 0.0 0.5 0.00 0.0
# max4 1.0 1.0 1.0 0.25 0.2
# max5 1.0 0.5 1.0 0.00 0.1

Interpretation and subset validation:

1 with max of 2: comparing [[1]] (first column) with the max value from [[2]] (second row), half of 1's values are greater, so we have 0.5 (as expected).
5 with max of 4: comparing [[5]] (fifth column) with the max value from [[4]] (fourth row), 0.2 meet the condition.

Less than the other's min

sapply(splt, function(y) sapply(setNames(splt, paste0("min", seq_along(splt))), function(z) sum(y <= min(z)) / length(y)))
#        1   2   3    4   5
# min1 0.5 0.5 0.5 1.00 1.0
# min2 0.0 0.5 0.0 1.00 0.8
# min3 0.0 0.5 0.5 1.00 1.0
# min4 0.0 0.0 0.0 0.25 0.2
# min5 0.0 0.0 0.0 0.00 0.1

Same two pairs:

1 with min of 2 (row 2, column 1) is 0, as expected
5 with min of 4 (row 4, column 5) is 0.2, as expected