Home > Software engineering >  Take 20 subsets of data?
Take 20 subsets of data?

Time:04-09

I have a dataset and would like to take a lot of subsets based on various columns, values, and conditional operators. I think the most desirable output is a list containing all of these subseted dataframes as separate elements in the list. I attempted to do this by building a dataframe that contains the subset conditions I would like to use, building a function, then using apply to feed that dataframe to the function, however that didn't work. I'm sure there's probably a better method that uses an anonymous function or something like that, but I'm not sure how I would implement that. Below is example code that should produce 8 subsets of data.

Original dataset, where x1 and x2 are scores on items that won't be used for subsetting and RT and LS are the variables that will be subset on:

df <- data.frame(x1 = rnorm(100),
                 x2 = rnorm(100),
                 RT = abs(rnorm(100)),
                 LS = sample(1:10, 100, replace = T))

Dataframe containing the conditions for subsetting. E.g., the first subset of data should be any observations with values greater than or equal to 0.5 in the RT column, the second subset should be any observations greater than or equal to 1 in the subset column, etc. There should be 8 subsets, 4 done on the RT variable and 4 done on the LS variable.

subsetConditions <- data.frame(column = rep(c("RT", "LS"), each = 4),
                      operator = rep(c(">=", "<="), each = 4),
                      value = c(0.5, 1, 1.5, 2,
                                9, 8, 7, 6))

And this is the ugly function I wrote to attempt to do this:

subsetFun <- function(x){
  subset(df, eval(parse(text = paste(x))))
}  

subsets <- apply(subsetConditions, 1, subsetFun)

Thanks for any help!

CodePudding user response:

You were actually pretty close with your function, but just needed to make an adjustment. So, with paste for each row, you need to collapse all 3 columns so that it is only 1 string rather than 3, then it can properly evaluate the expression.

subsetFun <- function(x){
  subset(df, eval(parse(text = paste(x, collapse = ""))))
}  

subsets <- apply(subsetConditions, 1, subsetFun)

Output

Then, it will return the 8 subsets.

str(subsets)

List of 8
 $ :'data.frame':   67 obs. of  4 variables:
  ..$ x1: num [1:67] -1.208 0.606 -0.17 0.728 -0.424 ...
  ..$ x2: num [1:67] 0.4058 -0.3041 -0.3357 0.7904 -0.0264 ...
  ..$ RT: num [1:67] 1.972 0.883 0.598 0.633 1.517 ...
  ..$ LS: int [1:67] 8 9 2 10 8 5 3 4 7 2 ...
 $ :'data.frame':   35 obs. of  4 variables:
  ..$ x1: num [1:35] -1.2083 -0.4241 -0.0906 0.9851 -0.8236 ...
  ..$ x2: num [1:35] 0.4058 -0.0264 1.0054 0.0653 1.4647 ...
  ..$ RT: num [1:35] 1.97 1.52 1.05 1.63 1.47 ...
  ..$ LS: int [1:35] 8 8 5 4 7 3 1 6 8 6 ...
 $ :'data.frame':   16 obs. of  4 variables:
  ..$ x1: num [1:16] -1.208 -0.424 0.985 0.99 0.939 ...
  ..$ x2: num [1:16] 0.4058 -0.0264 0.0653 0.3486 -0.7562 ...
  ..$ RT: num [1:16] 1.97 1.52 1.63 1.85 1.8 ...
  ..$ LS: int [1:16] 8 8 4 6 10 2 6 6 3 9 ...
 $ :'data.frame':   7 obs. of  4 variables:
  ..$ x1: num [1:7] 0.963 0.423 -0.444 0.279 0.417 ...
  ..$ x2: num [1:7] 0.6612 0.0354 0.0555 0.1253 -0.3056 ...
  ..$ RT: num [1:7] 2.71 2.15 2.05 2.01 2.07 ...
  ..$ LS: int [1:7] 2 6 9 9 7 7 4
 $ :'data.frame':   91 obs. of  4 variables:
  ..$ x1: num [1:91] -0.952 -1.208 0.606 -0.17 -0.048 ...
  ..$ x2: num [1:91] -0.645 0.406 -0.304 -0.336 -0.897 ...
  ..$ RT: num [1:91] 0.471 1.972 0.883 0.598 0.224 ...
  ..$ LS: int [1:91] 6 8 9 2 1 8 4 5 3 4 ...
 $ :'data.frame':   75 obs. of  4 variables:
  ..$ x1: num [1:75] -0.952 -1.208 -0.17 -0.048 -0.424 ...
  ..$ x2: num [1:75] -0.6448 0.4058 -0.3357 -0.8968 -0.0264 ...
  ..$ RT: num [1:75] 0.471 1.972 0.598 0.224 1.517 ...
  ..$ LS: int [1:75] 6 8 2 1 8 4 5 3 4 1 ...
 $ :'data.frame':   65 obs. of  4 variables:
  ..$ x1: num [1:65] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
  ..$ x2: num [1:65] -0.645 -0.336 -0.897 -2.072 1.005 ...
  ..$ RT: num [1:65] 0.471 0.598 0.224 0.486 1.053 ...
  ..$ LS: int [1:65] 6 2 1 4 5 3 4 1 7 4 ...
 $ :'data.frame':   58 obs. of  4 variables:
  ..$ x1: num [1:58] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
  ..$ x2: num [1:58] -0.645 -0.336 -0.897 -2.072 1.005 ...
  ..$ RT: num [1:58] 0.471 0.598 0.224 0.486 1.053 ...
  ..$ LS: int [1:58] 6 2 1 4 5 3 4 1 4 2 ...

CodePudding user response:

Consider Map (wrapper to mapply) without any eval parse. Since ==, <=, >=, and other operators can be used as functions with two arguments where 4 <= 5 can be written as `<=`(4,5) or "<="(4, 5), simply pass arguments elementwise and use get to reference the function by string:

sub_data <- function(col, op, val) {
  df[get(op)(df[[col]], val),]
}

sub_dfs <- with(subsetConditions, Map(sub_data, column, operator, value))

Output

str(sub_dfs)
List of 8
 $ RT:'data.frame': 62 obs. of  4 variables:
  ..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 1.63 ...
  ..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.662 ...
  ..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.543 ...
  ..$ LS: int [1:62] 5 5 1 2 9 1 5 9 3 10 ...
 $ RT:'data.frame': 36 obs. of  4 variables:
  ..$ x1: num [1:36] -0.745 0.848 0.908 -0.761 0.74 ...
  ..$ x2: num [1:36] -2.3849 -0.3131 -2.4645 -0.0784 0.8512 ...
  ..$ RT: num [1:36] 1.66 2.15 1.74 1.65 1.13 ...
  ..$ LS: int [1:36] 5 2 1 5 9 10 2 7 1 3 ...
 $ RT:'data.frame': 14 obs. of  4 variables:
  ..$ x1: num [1:14] -0.745 0.848 0.908 -0.761 -1.063 ...
  ..$ x2: num [1:14] -2.3849 -0.3131 -2.4645 -0.0784 -2.9886 ...
  ..$ RT: num [1:14] 1.66 2.15 1.74 1.65 2.63 ...
  ..$ LS: int [1:14] 5 2 1 5 5 6 9 4 8 4 ...
 $ RT:'data.frame': 3 obs. of  4 variables:
  ..$ x1: num [1:3] 0.848 -1.063 0.197
  ..$ x2: num [1:3] -0.313 -2.989 0.709
  ..$ RT: num [1:3] 2.15 2.63 2.05
  ..$ LS: int [1:3] 2 5 6
 $ LS:'data.frame': 92 obs. of  4 variables:
  ..$ x1: num [1:92] -1.12 -0.745 -1.377 0.848 0.612 ...
  ..$ x2: num [1:92] -0.257 -2.385 0.805 -0.313 0.958 ...
  ..$ RT: num [1:92] 0.693 1.662 0.731 2.145 0.489 ...
  ..$ LS: int [1:92] 5 5 1 2 1 9 1 5 9 3 ...
 $ LS:'data.frame': 78 obs. of  4 variables:
  ..$ x1: num [1:78] -1.12 -0.745 -1.377 0.848 0.612 ...
  ..$ x2: num [1:78] -0.257 -2.385 0.805 -0.313 0.958 ...
  ..$ RT: num [1:78] 0.693 1.662 0.731 2.145 0.489 ...
  ..$ LS: int [1:78] 5 5 1 2 1 1 5 3 5 2 ...
 $ LS:'data.frame': 75 obs. of  4 variables:
  ..$ x1: num [1:75] -1.12 -0.745 -1.377 0.848 0.612 ...
  ..$ x2: num [1:75] -0.257 -2.385 0.805 -0.313 0.958 ...
  ..$ RT: num [1:75] 0.693 1.662 0.731 2.145 0.489 ...
  ..$ LS: int [1:75] 5 5 1 2 1 1 5 3 5 2 ...
 $ LS:'data.frame': 62 obs. of  4 variables:
  ..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 0.612 ...
  ..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.958 ...
  ..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.489 ...
  ..$ LS: int [1:62] 5 5 1 2 1 1 5 3 5 2 ...
  • Related