I am trying to replicate this excel formula:
Excel
=IF(AND(var1[i 1]=var1[i],var2[i 1]=var2[i],var3[i 1]=var3[i]),0,1)
Basically I need to identify if row[i 1] is equal to row[i] using three (3) variables: "periodo_ocurr", "tipo_riesgo" and "paciente". If row[i 1] is equal to row[i] then 0 else 1
The code in R:
for(i in 1 : nrow(data_ocurr_2019))
{data_ocurr_2019$Flag_Pac[i 1] <- ifelse(data_ocurr_2019$periodo_ocurr[i 1] == data_ocurr_2019$periodo_ocurr[i] &
data_ocurr_2019$tipo_riesgo[i 1] == data_ocurr_2019$tipo_riesgo[i] &
data_ocurr_2019$paciente[i 1] == data_ocurr_2019$paciente[i], 0, 1)}
What is the issue? my data is over 1 million records and it takes a long time to execute it
Is there another way to do it? more efficient?
Any help will be appreciated
CodePudding user response:
Try:
data_ocurr_2019$Flag_Pac = ifelse(lead(data_ocurr_2019$periodo_ocurr) == data_ocurr_2019$periodo_ocurr &
lead(data_ocurr_2019$tipo_riesgo) == data_ocurr_2019$tipo_riesgo &
lead(data_ocurr_2019$paciente) == data_ocurr_2019$paciente, 0, 1)
This should be much faster, since it relies on a vectorized calculation, so R can compile the code once and apply it to all rows at once.
More explanation of vectorization in R: https://www.noamross.net/archives/2014-04-16-vectorization-in-r-why/
CodePudding user response:
You do not need ifelse
or a loop. First provide reproducible data using dput
for testing:
test <- structure(list(Var1 = c(5.1, 4.7, 4.6, 4.6, 4.6, 5, 5, 5, 4.6,
4.4, 4.9, 4.8, 4.3, 5.8, 5.8, 5.4, 5.4, 5.1, 5.1, 5.1), Var2 = c(3.5,
3.2, 3.1, 3.1, 3.1, 3.6, 3.6, 3.6, 3.4, 2.9, 3.1, 3, 3, 4, 4,
3.9, 3.9, 3.5, 3.5, 3.8), Var3 = c(1.4, 1.3, 1.5, 1.5, 1.5, 1.4,
1.4, 1.4, 1.4, 1.4, 1.5, 1.4, 1.1, 1.2, 1.2, 1.3, 1.3, 1.4, 1.4,
1.5)), row.names = c(NA, -20L), class = "data.frame")
Now compare rows:
nrow <- dim(test)[1]
ncol <- dim(test)[2]
match <- rowSums(test[1:(nrow-1), ] == test[2:nrow, ]) == ncol
idx <- which(match) 1
idx
# 3 4 6 7 14 16 18
# 4 5 7 8 15 17 19
We compare adjacent rows and then sum over the row to see if all three columns match. Then then we increment the row number assuming you want to identify the duplicate row, not the preceding row. You won't be able to use the next lines because of the size of your data, but it illustrates the results with this example:
flag <- rep(" ", nrow)
flag[idx] <- "<---"
cbind(test, flag)
# Var1 Var2 Var3 flag
# 1 5.1 3.5 1.4
# 2 4.7 3.2 1.3
# 3 4.6 3.1 1.5
# 4 4.6 3.1 1.5 <---
# 5 4.6 3.1 1.5 <---
# 6 5.0 3.6 1.4
# 7 5.0 3.6 1.4 <---
# 8 5.0 3.6 1.4 <---
# 9 4.6 3.4 1.4
# 10 4.4 2.9 1.4
# 11 4.9 3.1 1.5
# 12 4.8 3.0 1.4
# 13 4.3 3.0 1.1
# 14 5.8 4.0 1.2
# 15 5.8 4.0 1.2 <---
# 16 5.4 3.9 1.3
# 17 5.4 3.9 1.3 <---
# 18 5.1 3.5 1.4
# 19 5.1 3.5 1.4 <---
# 20 5.1 3.8 1.5