Home > Software engineering >  In R, how can I validate whether values are valid according to an external table?
In R, how can I validate whether values are valid according to an external table?

Time:06-10

I have a reference table of acceptable column values, represented as good_df below. I want to use it to find out which values in the previously unseen test_df are not among the list of acceptable ones.

library(tidyverse)
good_df <- tribble(
  ~column, ~value,
  "col1", "a",
  "col1", "b",
  "col1", "c",
  "col1", "d",
  "col1", "e",
  "col2", "A",
  "col2", "B",
  "col2", "C"
)

set.seed(1)
test_df <- tibble(
  col1 = sample(letters[1:3], 4, T),
  col2 = c(sample(LETTERS[1:3], 3, T), "D"),  # The D is invalid.
  col3 = rnorm(4)
)
test_df
#> # A tibble: 4 × 3
#>   col1  col2    col3
#>   <chr> <chr>  <dbl>
#> 1 a     A      0.330
#> 2 c     C     -0.820
#> 3 a     C      0.487
#> 4 b     D      0.738

Created on 2022-06-09 by the reprex package (v2.0.1)

My thought is to use pivot_longer to convert test_df to a format that matches good_df, and then use setdiff to see what remains after the valid rows are removed, which must be the invalid ones. Something along the lines of this below, producing expected_output.

long_test_df <- pivot_longer(test_df, "MAGIC HAPPENS HERE") 

long_test_df %>% 
  select(column, value) %>% 
  setdiff(good_df)

expected_output <- tribble(
  ~column, ~value,
  "col2", "D",
)

I can't seem to get pivot_longer to work as I'm intending it. Maybe it's not designed to? I also thought of using the validate package, but I don't see anything in the documentation about explicitly listing allowed or forbidden values in creating the rules, unless I missed something.

Also of course, in practice I have many more acceptable column/value pairs and many more columns in my data sets to validate.

I am open to alternatives to my approach and to alternative forms of storing the set of allowed column values. How may I achieve this? (Bonus points for pretty pipes and fewer named variables!)

Edit - Solution:

A modified version of Jonathan's answer gives the fully general solution. (Also, I still prefer setdiff to anti_join here, although that's just a personal preference.)

test_df %>% 
  pivot_longer(cols = any_of(good_df$column), names_to = "column", values_to = "value") %>% 
  select(column, value) %>% 
  anti_join(good_df)

CodePudding user response:

Here is the code for pivot_longer, the missing values can be obtained with anti_join:

library(tidyverse)

test_df %>% 
  pivot_longer(c(col1, col2), names_to = "column", values_to = "value") %>% 
  anti_join(good_df)

CodePudding user response:

You might additionally consider this approach

cn = unique(good_df$column)
melt(test_df[,..cn], measure=cn,variable = "column")[!good_df, on=.(column, value)]

Output:

   column value
1:   col2     D

where test_df and good_df are of class data.table. That is:

library(data.table)
setDT(good_df); setDT(test_df)
  • Related