Home > database >  Deriving cosine values for vector contrasts distributed over rows in a dataframe (rows to individual
Deriving cosine values for vector contrasts distributed over rows in a dataframe (rows to individual

Time:07-04

I am attempting to use the lsa::cosine function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector each row is a 15-item vector

My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2). Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).

I've tried selecting observations rowwise:

dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))

but am getting an 'argument is not a matrix' error

In isolation, this code snippet works: maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))

The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.

Is there a way to do this avoiding a 'for' loop?

CodePudding user response:

I'm assuming that you aren't looking for vectorization, as well (i.e., lapply or map).

This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.

library(lsa)
library(tidyverse)

set.seed(1)

df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)

Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)

df2 <- df1
df3 <- df2[-1, ]               # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()

So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.

df2 %>% 
  rowwise %>% 
  mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>% 
  select(cosr) %>% unlist()
#     cosr1     cosr2     cosr3     cosr4     cosr5     cosr6     cosr7     cosr8 
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681 
#     cosr9    cosr10 
# 0.9972666        NA

If in doubt, you can always use a loop or vectorization to validate the numbers.

for(i in 1:(nrow(df1) - 1)) {
  v1 <- df1[i, ] %>% unlist()
  v2 <- df1[i   1, ] %>% unlist()
  message(cosine(v1, v2))
}

invisible(
  lapply(1:(nrow(df1) - 1),
         function(i) {message(cosine(unlist(df1[i, ]), 
                                     unlist(df1[i   1, ])))}))

CodePudding user response:

My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.

Here is a reproducible data

library(dplyr) 
library(tidyr)
library(lsa)

set.seed(1)
df <- data.frame(replicate(15,runif(10))) 

The actual workflow:

df %>% 
    rowwise %>% 
    summarise(row_v = list(c_across()))  %>%
    mutate(nextrow_v = lead(row_v)) %>%
    replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
    rowwise %>%
    summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v))) 

# A tibble: 10 x 1
# Rowwise: 
   cosr[,1]
     <dbl>
 1   0.820
 2   0.791
 3   0.780
 4   0.785
 5   0.838
 6   0.808
 7   0.718
 8   0.743
 9   0.773
10  NA
  • Related