I have a reference table of acceptable column values, represented as good_df
below. I want to use it to find out which values in the previously unseen test_df
are not among the list of acceptable ones.
library(tidyverse)
good_df <- tribble(
~column, ~value,
"col1", "a",
"col1", "b",
"col1", "c",
"col1", "d",
"col1", "e",
"col2", "A",
"col2", "B",
"col2", "C"
)
set.seed(1)
test_df <- tibble(
col1 = sample(letters[1:3], 4, T),
col2 = c(sample(LETTERS[1:3], 3, T), "D"), # The D is invalid.
col3 = rnorm(4)
)
test_df
#> # A tibble: 4 × 3
#> col1 col2 col3
#> <chr> <chr> <dbl>
#> 1 a A 0.330
#> 2 c C -0.820
#> 3 a C 0.487
#> 4 b D 0.738
Created on 2022-06-09 by the reprex package (v2.0.1)
My thought is to use pivot_longer
to convert test_df
to a format that matches good_df
, and then use setdiff
to see what remains after the valid rows are removed, which must be the invalid ones. Something along the lines of this below, producing expected_output
.
long_test_df <- pivot_longer(test_df, "MAGIC HAPPENS HERE")
long_test_df %>%
select(column, value) %>%
setdiff(good_df)
expected_output <- tribble(
~column, ~value,
"col2", "D",
)
I can't seem to get pivot_longer
to work as I'm intending it. Maybe it's not designed to? I also thought of using the validate package, but I don't see anything in the documentation about explicitly listing allowed or forbidden values in creating the rules, unless I missed something.
Also of course, in practice I have many more acceptable column/value pairs and many more columns in my data sets to validate.
I am open to alternatives to my approach and to alternative forms of storing the set of allowed column values. How may I achieve this? (Bonus points for pretty pipes and fewer named variables!)
Edit - Solution:
A modified version of Jonathan's answer gives the fully general solution. (Also, I still prefer setdiff
to anti_join
here, although that's just a personal preference.)
test_df %>%
pivot_longer(cols = any_of(good_df$column), names_to = "column", values_to = "value") %>%
select(column, value) %>%
anti_join(good_df)
CodePudding user response:
Here is the code for pivot_longer
, the missing values can be obtained with anti_join
:
library(tidyverse)
test_df %>%
pivot_longer(c(col1, col2), names_to = "column", values_to = "value") %>%
anti_join(good_df)
CodePudding user response:
You might additionally consider this approach
cn = unique(good_df$column)
melt(test_df[,..cn], measure=cn,variable = "column")[!good_df, on=.(column, value)]
Output:
column value
1: col2 D
where test_df
and good_df
are of class data.table
. That is:
library(data.table)
setDT(good_df); setDT(test_df)