R: More efficient code for creating and manipulating variables based on other variables and another-CodePudding

I'm trying to tidy up my code. At the moment I have several chunks with 25 lines of the same code, individually changing each variable. I'm wondering whether there is a way to apply all of this to the relevant variables without having it all written out for each one individually.

I'm working with two tables:

data (200x115) - my dataset. The changes I'm making here apply to every other column in starting from the 19th column, up until the 67th, all of which are chr variables (we'll call these a b c ... y - but the actual names are random words with no pattern to them).

match.words (92393x1) - list of words. I'm using this to identify words in my dataset that match or don't match words in this list.

I started by creating new variables (a.match, b.match... y.match) to indicate whether each case in a b [...] y matches a word in match.words.

data$a.match <- ifelse(data$a %in% match.words$. == TRUE, 0, 1)
data$b.match <- ifelse(data$b %in% match.words$. == TRUE, 0, 1)
...
data$y.match <- ifelse(data$y %in% match.words$. == TRUE, 0, 1)

I then go through and check each of these to see which cases don't match - this code gives me 25 tables which is fine, but I'm wondering whether it's possible to obtain either all of these or a combined version with more concise code.

data %>% filter(a.match != 0) %>% select(ID, subject, a)
data %>% filter(b.match != 0) %>% select(ID, subject, b)
...
data %>% filter(y.match != 0) %>% select(ID, subject, y)

I then replace the contents of a b [...] y with NA where the content of the new variables, a.match b.match [...] y.match, is 1.

data$a[data$a.match == 1] = NA
data$b[data$b.match == 1] = NA
...
data$y[data$y.match == 1] = NA

I'm assuming that this is something you might use a loop for, but I've had some trouble getting my head around how these work.

EDIT

The aim of this is just to check (amend if necessary) and replace cases (across those 25 variables) that don't match cases in match.words, with NA.

The first chunk is identifying cases that don't match, the second is checking those unmatched cases in (e.g., to fix spelling mistakes - in this case I would then go back and re-run the first chunk), and then replacing unmatched cases with NA, based on the coding of the new variables, which are based on matches between original variables and the word list. Hope that makes sense!

Example data:

data <- structure(list(Var1 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
Var2 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
Var3 = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
Var3.5 = c(100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L), 
Var4 = c(1353L, 1398L, 2168L, 1354L, 1966L, 2396L, 3367L, 2955L, 1568L, 1037L), 
Var5 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
Var6 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ID = c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu", "vwx", "yza", "bcd"), 
Var8 = c("anonymous", "anonymous", "anonymous", "anonymous", "anonymous", "anonymous", "anonymous", "anonymous", "anonymous", "anonymous"), 
Var9 = c("EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN", "EN"), 
Var10 = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), 
Var11 = c(1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L), 
Var12 = c(24L, 24L, 37L, 31L, 22L, 25L, 26L, 23L, 22L, 21L), 
Var13 = c(9L, 9L, 9L, 13L, 13L, 15L, 9L, 9L, 13L, 9L), 
Var14 = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L), Var15 = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
Var16 = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), 
Var17 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
a = c("money", "clock", "now", "day", "find", "2", "clock", "parents", "stop", "unlimited"), 
a_RT = c(5.577, 5.762, 6.582, 7.255, 5.968, 4.755, 4.948, 3.092, 2.948, 10.262), 
b = c("mom", "people", "me", "tired", "anxiety", "home", "people", "man", "happy", "being"), 
b_RT = c(5.338, 3.372, 5.396, 4.55, 6.142, 5.484, 5.093, 0.385, 2.835, 3.102), 
c = c("date", "lost", "right", "2021", "birthday", "time", "covid", "old", "sad", "fear"), 
c_RT = c(3.535, 3.22, 8.571, 4.247, 9.373, 4.407, 4.293, 2.794, 3.168, 3.466), 
d = c("the", "forward", "that", "straight", "road", "through", "journey", "right", "move", "path"), 
d_RT = c(5.345, 3.865, 6.72, 6.401, 7.844, 4.862, 11.563, 2.876, 3.811, 4.122), 
e = c("sun", "home", "today", "friday", "night", "hour", "time", "bed", "friends", "monday"), 
e_RT = c(3.451, 7.779, 6.597, 6.2, 3.868, 4.173, 4.324, 2.9, 2.352, 3.298), 
f = c("of", "cat", "nothing", "car", "something", "ting", "toy", "object", "about", "being"), 
f_RT = c(4.685, 4.307, 9.874, 4.379, 9.93, 4.318, 4.365, 4.348, 3.019, 2.536), 
g = c("woman", "country", "elephant", "culture", "travel", "laptop", "kids", "ghost", "human", "him"), 
g_RT = c(5.819, 3.067, 5.178, 7.951, 5.129, 4.018, 4.31, 2.814, 3.212, 3.748), 
h = c("grass", "wide", "earth", "swim", "big", "sphere", "peace", "globe", "tide", "round"), 
h_RT = c(4.723, 3.183, 4.611, 6.801, 12.194, 4.595, 4.348, 2.446, 2.531, 12.612), 
i = c("enjoy", "death", "long", "love", "grave", "plant", "struggle", "family", "flower", "limited"), 
i_RT = c(4.41, 3.771, 4.865, 4.648, 7.344, 4.813, 4.516, 2.488, 4.426, 9.702), 
j = c("right", "health", "mine", "innovation", "ability", "finger", "feet", "foot", "stem", "sport"), 
j_RT = c(6.442, 4.917, 4.909, 17.376, 5.744, 4.522, 3.973, 2.295, 5.181, 4.411), 
k = c("of", "pizza", "car", "all", "whole", "way", "two", "body", "finger", "bit"), 
k_RT = c(3.724, 4.226, 10.101, 5.86, 5.507, 3.791, 4.947, 2.367, 5.058, 3.35), 
l = c("needs", "adult", "sport", "parents", "mother", "dog", "dodge", "car", "boy", "baby"), 
l_RT = c(4.376, 4.442, 9.954, 5.935, 6.02, 4.627, 4.962, 3.366, 5.075, 2.981), 
m = c("glass", "head", "ball", "sight", "ring", "strain", "toe", "dad", "cornea", "lashes"), 
m_RT = c(3.66, 3.426, 3.839, 4.062, 12.948, 4.545, 3.591, 2.641, 9.715, 3.571), 
n = c("man", "beard", "friend", "kindness", "girl", "hair", "suit", "gun", "hair", "her"), 
n_RT = c(6.252, 2.873, 3.712, 6.566, 9.654, 8.299, 3.913, 1.992, 2.412, 2.82), 
o = c("thing", "location", "castle", "home", "memory", "restaurant", "beach", "london", "turkey", "path"), 
o_RT = c(4.429, 3.999, 7.048, 3.016, 3.681, 8.167, 6.264, 3.077, 3.38, 3.535), 
p = c("job", "school", "city", "hospital", "money", "time", "home", "work", "office", "watch"), 
p_RT = c(3.309, 2.984, 5.778, 11.584, 4.806, 4.224, 4.269, 2.494, 2.824, 2.982), 
q = c("picnic", "weekend", "date", "work", "day", "sun", "play", "meal", "calendar", "long"), 
q_RT = c(3.58, 3.267, 9.026, 7.28, 6.126, 3.755, 3.394, 2.302, 6.186, 3.836), 
r = c("home", "study", "sensitive", "management", "lawyer", "briefcase", NA, NA, "hammer", "bag"), 
r_RT = c(3.766, 2.949, 6.413, 13.084, 10.924, 5.315, NA, NA, 3.943, 2.676), 
s = c("job", "teacher", "college", "study", "insomnia", "work", "money", "occupation", "lanyard", "study"), 
s_RT = c(4.889, 3.19, 5.68, 3.754, 12.803, 3.843, 5.602, 5.452, 5.589, 4.032), 
t = c("fool", "politics", "bad", "build", "country", "people", "conservative", "shame", "lost", "group"), 
t_RT = c(3.969, 3.866, 5.816, 3.901, 5.63, 4.783, 6.768, 2.29, 5.768, 7.114), 
u = c("job", "workforce", "amazon", "work", "friend", "time", "money", "corporation", "suit", "economy"), 
u_RT = c(3.05, 4.575, 11.664, 4.17, 4.261, 5.422, 5.02, 3.615, 4.766, 6.473), 
v = c("one", "pi", "3", "1", "phone", "some", "door", "phone", "nine", "seven"), 
v_RT = c(2.716, 3.974, 3.07, 3.375, 7.35, 3.289, 4.326, 2.493, 4.793, 4.053), 
w = c("therapy", "project", "students", "share", "support", "help", "project", "people", "friends", "alone"), 
w_RT = c(3.35, 4.309, 10.611, 7.638, 5.998, 5.798, 4.644, 3.348, 2.785, 2.52), 
x = c("solve", "finished", "broken", "fix", "books", "situation", "solve", "solution", "math", "annoying"), 
x_RT = c(3.425, 2.988, 5.617, 4.055, 8.834, 4.357, 4.386, 3.241, 2.615, 3.92), 
y = c("factoid", "library", "fiction", "life", "assumption", "true", "finding", "knowledge", "information", "firm"), 
y_RT = c(6.431, 4.637, 4.734, 8.892, 8.086, 4.415, 4.486, 4.987, 2.919, 3.133), 
Var18 = c(2L, 1L, 1L, 0L, 0L, 3L, 0L, 2L, 0L, 2L), 
Var19 = c(3L, 3L, 0L, 0L, 3L, 2L, 2L, 2L, 3L, 2L), 
Var20 = c(3L, 1L, 3L, 0L, 1L, 3L, 0L, 2L, 2L, 2L), 
Var21 = c(1L, 2L, 0L, 0L, 0L, 0L, 0L, 3L, 1L, 0L), 
Var22 = c(1L, 0L, 1L, 0L, 0L, 2L, 0L, 3L, 3L, 1L), 
Var23 = c(2L, 2L, 3L, 0L, 3L, 0L, 2L, 2L, 3L, 2L), 
Var24 = c(2L, 0L, 3L, 0L, 1L, 2L, 0L, 2L, 1L, 1L), 
Var25 = c(1L, 0L, 0L, 0L, 3L, 0L, 0L, 2L, 3L, 2L), 
Var26 = c(2L, 1L, 2L, 0L, 3L, 0L, 0L, 3L, 2L, 2L), 
Var27 = c(2L, 1L, 0L, 0L, 0L, 0L, 2L, 2L, 3L, 2L), 
Var28 = c(2L, 1L, 1L, 0L, 3L, 3L, 2L, 3L, 3L, 2L), 
Var29 = c(0L, 2L, 3L, 0L, 2L, 0L, 3L, 1L, 3L, 1L), 
Var30 = c(2L, 0L, 3L, 0L, 0L, 1L, 0L, 2L, 3L, 2L), 
Var31 = c(1L, 0L, 3L, 0L, 1L, 0L, 2L, 1L, 3L, 0L), 
Var32 = c(3L, 2L, 3L, 0L, 1L, 3L, 2L, 2L, 3L, 3L), 
Var33 = c(1L, 0L, 0L, 0L, 3L, 1L, 1L, 1L, 1L, 1L), 
Var34 = c(2L, 1L, 3L, 0L, 2L, 3L, 0L, 2L, 1L, 2L), 
Var35 = c(1L, 0L, 3L, 0L, 0L, 0L, 2L, 1L, 0L, 1L), 
Var36 = c(1L, 1L, 2L, 1L, 3L, 0L, 3L, 1L, 1L, 2L), 
Var37 = c(0L, 1L, 1L, 1L, 3L, 0L, 2L, 1L, 1L, 2L), 
Var38 = c(0L, 1L, 2L, 1L, 3L, 0L, 0L, 1L, 1L, 2L), 
Var39 = c(1L, 1L, 3L, 0L, 3L, 0L, 1L, 0L, 3L, 2L), 
Var40 = c(1L, 2L, 0L, 0L, 3L, 0L, 0L, 1L, 3L, 2L), 
Var41 = c(1L, 2L, 1L, 1L, 3L, 0L, 1L, 1L, 1L, 2L), 
Var42 = c(0L, 0L, 2L, 0L, 3L, 0L, 0L, 1L, 1L, 2L), 
Var43 = c(2L, 1L, 1L, 0L, 3L, 2L, 0L, 0L, 1L, 2L), 
Var44 = c(2L, 2L, 2L, 0L, 3L, 3L, 1L, 2L, 3L, 2L), 
Var45 = c(3L, 2L, 3L, 0L, 3L, 0L, 0L, 2L, 3L, 2L), 
Var46 = c(2L, 1L, 3L, 0L, 3L, 0L, 1L, 1L, 3L, 2L), 
Var47 = c(2L, 2L, 1L, 0L, 3L, 2L, 1L, 2L, 3L, 2L), 
Var48 = c(2L, 2L, 3L, 0L, 3L, 2L, 1L, 2L, 3L, 2L), 
Var49 = c(2L, 1L, 2L, 0L, 1L, 2L, 2L, 2L, 3L, 2L), 
Var50 = c(2L, 1L, 1L, 0L, 2L, 3L, 2L, 1L, 3L, 3L), 
Var51 = c(2L, 1L, 1L, 0L, 2L, 0L, 1L, 2L, 3L, 3L), 
Var52 = c(1L, 0L, 2L, 0L, 3L, 2L, 1L, 1L, 3L, 2L), 
Var53 = c(3L, 2L, 2L, 0L, 2L, 1L, 1L, 2L, 3L, 2L), 
Var54 = c(2L, 1L, 1L, 0L, 3L, 0L, 3L, 1L, 3L, 2L), 
Var55 = c(3L, 5L, 2L, 7L, 7L, 7L, 2L, 5L, 5L, 5L), 
Var56 = c(5L, 5L, 4L, 7L, 7L, 7L, 3L, 5L, 6L, 7L), 
Var57 = c(5L, 3L, 3L, 6L, 5L, 7L, 4L, 5L, 7L, 6L), 
Var58 = c(5L, 5L, 5L, 5L, 7L, 7L, 7L, 5L, 4L, 7L), 
Var59 = c(5L, 3L, 6L, 3L, 7L, 7L, 3L, 5L, 5L, 6L), 
Var60 = c(5L, 4L, 6L, 4L, 2L, 5L, 4L, 5L, 7L, 6L), 
Var61 = c(4L, 3L, 6L, 2L, 2L, 7L, 3L, 3L, 7L, 6L), 
Var62 = c(5L, 3L, 6L, 3L, 7L, 7L, 4L, 4L, 7L, 7L), 
subject = c(123L, 456L, 789L, 124L, 345L, 678L, 901L, 234L, 567L, 890L), 
Var64 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), 
row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))

match.words <- structure(list(. = c("night", "money", "kids", "friends", "about", 
"annoying", "amazon", "something", "london", "weekend", "glass", 
"travel", "math", "1", "work", "corporation", "him", "whole", 
"friend", "stop", "lawyer", "love", "forward", "share", "some", 
"fix", "finger", "that", "innovation", "firm", "one", "wide", 
"journey", "dog", "meal", "watch", "time", "road", "job", "knowledge", 
"tide", "shame", "foot", "unlimited", "of", "workforce", "body", 
"pi", "swim", "covid", "college", "suit", "death", "right", "her", 
"sensitive", "fiction", "pizza", "straight", "occupation", "flower", 
"build", "mine", NA, "lashes", "grass", "play", "mom", "3", "hair", 
"elephant", "limited", "turkey", "laptop", "grave", "ball", "people", 
"earth", "through", "nine", "today", "group", "conservative", 
"bed", "assumption", "cornea", "me", "factoid", "cat", "big", 
"alone", "briefcase", "health", "management", "hospital", "help", 
"castle", "sight", "hammer", "toe")), row.names = c(NA, -100L
), class = "data.frame")

CodePudding user response：

I hope I understood it well that your final goal is just keep your data "as is" with just the words if they do not match up with your words list to be replaced with NA. I used data.table here instead of tidyverse. You can all do in one step, where you apply over each column you expect words and this are the odd column indexes here. Then you just check for a match and if so, leave the word and else replace it with NA.

library(data.table)
setDT(data) # make it a data.table

col_index <- 19:67 # define your first and last one to check
col_index <- col_index[col_index %% 2 == 1] # use Modulus and pick the 1 values to get the "odds"
cols <- names(data)[col_index] # this are the column names you want to apply over

dictionary <- match.words$.

data[, (cols) := lapply(.SD, function(value) {
  ifelse(value %in% dictionary, value, NA)
}), .SDcols = cols]

CodePudding user response：

Here is another function to be applied in Merijn's answer. The code is a copy&paste from his answer, with the anonymous function changed to is.na<-. I also do not create the vector dictionary and use match.words$. directly

library(data.table)
setDT(data) 

col_index <- 19:67 
col_index <- col_index[col_index %% 2 == 1] 
cols <- names(data)[col_index] 

data[, (cols) := lapply(.SD, function(x) {
  is.na(x) <- !x %in% match.words$.
  x
}), .SDcols = cols]

^{Created on 2022-02-23 by the reprex package (v2.0.1)}

Timings.

The timings will be run with base function system.time. I start by making a copy of data named data2, set both to class "data.table" and run both solutions.

library(data.table)

data2 <- data
setDT(data)
setDT(data2)

col_index <- 19:67 
col_index <- col_index[col_index %% 2 == 1] 
cols <- names(data)[col_index]

t1 <- system.time({
  data[, (cols) := lapply(.SD, function(value) {
    ifelse(value %in% match.words$., value, NA)
  }), .SDcols = cols]
})
t2 <- system.time({
  data2[, (cols) := lapply(.SD, function(x) {
    is.na(x) <- !x %in% match.words$.
    x
  }), .SDcols = cols]
})
rbind(t1, t2)
#>    user.self sys.self elapsed user.child sys.child
#> t1      0.01        0    0.02         NA        NA
#> t2      0.00        0    0.00         NA        NA

^{Created on 2022-02-23 by the reprex package (v2.0.1)}

The results are already better for is.na<-, let's run this with bigger data.
Start by recreating data from the original post and make a copy of it, like above. Then a for loop makes it bigger with 1.3M rows. The results now show a real difference between the two, is.na<- is an order of magnitude faster than ifelse.

for(i in 1:17){
  data <- rbind(data, data)
}
dim(data)
#> [1] 1310720     115

data2 <- data

setDT(data)
setDT(data2)

col_index <- 19:67 # define your first and last one to check
col_index <- col_index[col_index %% 2 == 1] # use Modulus and pick the 1 values to get the "odds"
cols <- names(data)[col_index] # this are the column names you want to apply over

t1 <- system.time({
  data[, (cols) := lapply(.SD, function(value) {
    ifelse(value %in% match.words$., value, NA)
  }), .SDcols = cols]
})
t2 <- system.time({
  data2[, (cols) := lapply(.SD, function(x) {
    is.na(x) <- !x %in% match.words$.
    x
  }), .SDcols = cols]
})
rbind(t1, t2)
#>    user.self sys.self elapsed user.child sys.child
#> t1      6.32     0.21    6.55         NA        NA
#> t2      0.59     0.11    0.70         NA        NA

^{Created on 2022-02-23 by the reprex package (v2.0.1)}