I am attempting to use the lsa::cosine
function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector
each row is a 15-item vector
My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2)
. Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).
I've tried selecting observations rowwise:
dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))
but am getting an 'argument is not a matrix' error
In isolation, this code snippet works:
maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))
The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.
Is there a way to do this avoiding a 'for' loop?
CodePudding user response:
I'm assuming that you aren't looking for vectorization, as well (i.e., lapply
or map
).
This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.
library(lsa)
library(tidyverse)
set.seed(1)
df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)
Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)
df2 <- df1
df3 <- df2[-1, ] # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()
So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.
df2 %>%
rowwise %>%
mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>%
select(cosr) %>% unlist()
# cosr1 cosr2 cosr3 cosr4 cosr5 cosr6 cosr7 cosr8
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681
# cosr9 cosr10
# 0.9972666 NA
If in doubt, you can always use a loop or vectorization to validate the numbers.
for(i in 1:(nrow(df1) - 1)) {
v1 <- df1[i, ] %>% unlist()
v2 <- df1[i 1, ] %>% unlist()
message(cosine(v1, v2))
}
invisible(
lapply(1:(nrow(df1) - 1),
function(i) {message(cosine(unlist(df1[i, ]),
unlist(df1[i 1, ])))}))
CodePudding user response:
My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.
Here is a reproducible data
library(dplyr)
library(tidyr)
library(lsa)
set.seed(1)
df <- data.frame(replicate(15,runif(10)))
The actual workflow:
df %>%
rowwise %>%
summarise(row_v = list(c_across())) %>%
mutate(nextrow_v = lead(row_v)) %>%
replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
rowwise %>%
summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v)))
# A tibble: 10 x 1
# Rowwise:
cosr[,1]
<dbl>
1 0.820
2 0.791
3 0.780
4 0.785
5 0.838
6 0.808
7 0.718
8 0.743
9 0.773
10 NA