Home > Software engineering >  Variance of multiple columns using one set of rules
Variance of multiple columns using one set of rules

Time:10-25

I am just getting acquainted with the validate package. Unfortunately, at the very beginning I ran into a problem and I can't find the right solution. I would like to create one validation rule that I can later apply to multiple variables. I will show it on an example. I have such a tibble:

library(tidyverse)
library(validate)

df = tibble(
  id = rep(1:10, each=20),
  name = rep(paste0("v", 1:20), 10),
  value = rnorm(length(name))
) %>% pivot_wider()

otuput

# A tibble: 10 x 21
      id     v1     v2      v3      v4     v5     v6      v7       v8      v9    v10
   <int>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl>   <dbl>  <dbl>
 1     1  1.20   0.182 -1.53    2.73   -1.60  -0.976 -0.767  -2.28    -0.257   0.736
 2     2  0.484  0.913 -0.873  -0.801   0.172  1.11  -1.71    0.0125   0.0450  0.374
 3     3 -0.604 -0.405  0.482   0.998  -0.634  0.212  0.717   0.598   -0.876   0.139
 4     4 -0.324 -1.83   0.0195 -1.70    0.506 -0.139  3.21   -0.00169 -0.200  -1.03 
 5     5  0.268  1.40   0.349   0.667   1.76   0.926 -1.09   -0.487    2.03    0.203
 6     6  0.646  0.516  0.849  -0.619  -2.18   0.126 -0.0956 -0.471    0.0342  0.530
 7     7 -1.03  -1.27  -0.0716 -2.13   -0.340  1.20   0.746  -0.366   -2.82   -0.431
 8     8  0.415  0.313  0.591  -0.0552  0.132  1.86  -0.427   0.390   -0.506  -0.470
 9     9  0.309  1.13  -0.472   0.760  -0.549 -0.954 -0.219  -0.653    0.335  -0.870
10    10  1.06   1.30   1.12    0.646   0.279 -1.45  -0.891  -0.278    0.637   0.236
# ... with 10 more variables: v11 <dbl>, v12 <dbl>, v13 <dbl>, v14 <dbl>, v15 <dbl>,
#   v16 <dbl>, v17 <dbl>, v18 <dbl>, v19 <dbl>, v20 <dbl>

I can validate one variable using the following rule:

df %>% 
  confront(
    validator(
      num.val = is.numeric(v1),
      big.val = !(v1>10),
      low.val = !(v1< -10),
      NA.val = !is.na(v1)
    )
  ) %>% summary()
#      name items passes fails nNA error warning     expression
# 1 num.val     1      1     0   0 FALSE   FALSE is.numeric(v1)
# 2 big.val    10     10     0   0 FALSE   FALSE       v1 <= 10
# 3 low.val    10     10     0   0 FALSE   FALSE      v1 >= -10
# 4  NA.val    10     10     0   0 FALSE   FALSE     !is.na(v1)

However, I would like to apply this rule to multiple columns using some simple notation. Unfortunately, the code below does not work.

df %>% 
  confront(
    validator(
      num.val = is.numeric(v1:v20),
      big.val = !(v1:v20>10),
      low.val = !(v1:v20< -10),
      NA.val = !is.na(v1:v20)
    )
  ) %>% summary()
#      name items passes fails nNA error warning         expression
# 1 num.val     1      1     0   0 FALSE    TRUE is.numeric(v1:v20)
# 2 big.val     1      1     0   0 FALSE    TRUE       v1:v20 <= 10
# 3 low.val     1      1     0   0 FALSE    TRUE      v1:v20 >= -10
# 4  NA.val     1      1     0   0 FALSE    TRUE     !is.na(v1:v20)

I understand that I can always convert my data to long format.

df %>% 
  pivot_longer(v1:v20) %>% 
  confront(
    validator(
      num.val = is.numeric(value),
      big.val = !(value>10),
      low.val = !(value< -10),
      NA.val = !is.na(value)
    )
  ) %>% summary()
#      name items passes fails nNA error warning        expression
# 1 num.val     1      1     0   0 FALSE   FALSE is.numeric(value)
# 2 big.val   200    200     0   0 FALSE   FALSE       value <= 10
# 3 low.val   200    200     0   0 FALSE   FALSE      value >= -10
# 4  NA.val   200    200     0   0 FALSE   FALSE     !is.na(value)

However, in this case, I will not be able to determine in which variable the validation failed.

Any suggestions on how one can easily apply one validation rule to many selected variables?

CodePudding user response:

This way is from validate::syntax, using . to put whole data, but getting different result for num.val. I look up to Data Validation Cookbook but I cannot find the way about select multiple columns in simple way.

df %>% 
  select(-id) %>%
  confront(
    validator(
      num.val = is.numeric(.),
      big.val = !(.>10),
      low.val = !(.< -10),
      NA.val = !is.na(.)
    )
  ) %>% summary() 

     name items passes fails nNA error warning    expression
1 num.val     1      0     1   0 FALSE   FALSE is.numeric(.)
2 big.val   200    200     0   0 FALSE   FALSE       . <= 10
3 low.val   200    200     0   0 FALSE   FALSE      . >= -10
4  NA.val   200    200     0   0 FALSE   FALSE     !is.na(.)

CodePudding user response:

If we make a change in the OP's code in pivot_longer by group_spliting, it should work

library(purrr)
library(dplyr)
library(tidyr)
out <- df %>% 
  pivot_longer(v1:v20) %>% 
  group_split(name) %>% 
  map(~ .x %>% confront(
    validator(
      num.val = is.numeric(value),
      big.val = !(value>10),
      low.val = !(value< -10),
      NA.val = !is.na(value)
    )
  ) %>% summary()) 

-output

> out[1:4]
[[1]]
     name items passes fails nNA error warning        expression
1 num.val     1      1     0   0 FALSE   FALSE is.numeric(value)
2 big.val    10     10     0   0 FALSE   FALSE       value <= 10
3 low.val    10     10     0   0 FALSE   FALSE      value >= -10
4  NA.val    10     10     0   0 FALSE   FALSE     !is.na(value)

[[2]]
     name items passes fails nNA error warning        expression
1 num.val     1      1     0   0 FALSE   FALSE is.numeric(value)
2 big.val    10     10     0   0 FALSE   FALSE       value <= 10
3 low.val    10     10     0   0 FALSE   FALSE      value >= -10
4  NA.val    10     10     0   0 FALSE   FALSE     !is.na(value)

[[3]]
     name items passes fails nNA error warning        expression
1 num.val     1      1     0   0 FALSE   FALSE is.numeric(value)
2 big.val    10     10     0   0 FALSE   FALSE       value <= 10
3 low.val    10     10     0   0 FALSE   FALSE      value >= -10
4  NA.val    10     10     0   0 FALSE   FALSE     !is.na(value)

[[4]]
     name items passes fails nNA error warning        expression
1 num.val     1      1     0   0 FALSE   FALSE is.numeric(value)
2 big.val    10     10     0   0 FALSE   FALSE       value <= 10
3 low.val    10     10     0   0 FALSE   FALSE      value >= -10
4  NA.val    10     10     0   0 FALSE   FALSE     !is.na(value)
  • Related