I have a dataframe with duplicate IDs, but the rows are not identical. Is there a function, which identifies the column (or columns), where a difference appears?
My real application is a dataframe with hundreds of columns. I need some way to check, whether changes were made in important columns or in some irrelevant ones. So first of all I need to identify the changed columns.
Example:
ID <- c(1,2,2,4,5,5,5,6,6,7)
Info1 <- c(10,20,20,40,50,50,50,65,60,70)
Info2 <- c('A','B','A','D', 'E','E','F', 'Z','A','B')
Info3 <- c(999,998,997,995,995,995,995,946,800,805)
df <- data.frame(ID, Info1, Info2, Info3)
ID Info1 Info2 Info3
1 1 10 A 999
2 2 20 B 998
3 2 20 A 997
4 4 40 D 995
5 5 50 E 995
6 5 50 E 995
7 5 50 F 995
8 6 60 Z 946
9 6 60 A 800
10 7 70 B 805
My goal would be an additional column, which contains the changed column, i.e. desired output:
ID Info1 Info2 Info3 col_diff
1 1 10 A 999 <NA>
2 2 20 B 998 Info2; Info3
3 2 20 A 997 Info2; Info3
4 4 40 D 995 <NA>
5 5 50 H 995 Info2
6 5 50 E 995 Info2
7 5 50 F 995 Info2
8 6 65 Z 946 Info1; Info2; Info3
9 6 60 A 800 Info1; Info2; Info3
10 7 70 B 805 <NA>
I hope my problem became clear. I hope there is some function within dplyr, which I do not know yet. Of course my solution with the additional column is not really elegant. So I am open to any ideas, which may solve my problem.
Thanks a lot!
CodePudding user response:
func <- function(X) {
paste(names(
Filter(function(z) length(z) > 1,
lapply(X, unique))
), collapse = ";")
}
df %>%
group_by(ID) %>%
mutate(col_diff = func(cur_data())) %>%
ungroup()
# # A tibble: 10 x 5
# ID Info1 Info2 Info3 col_diff
# <dbl> <dbl> <chr> <dbl> <chr>
# 1 1 10 A 999 ""
# 2 2 20 B 998 "Info2;Info3"
# 3 2 20 A 997 "Info2;Info3"
# 4 4 40 D 995 ""
# 5 5 50 E 995 "Info2"
# 6 5 50 E 995 "Info2"
# 7 5 50 F 995 "Info2"
# 8 6 65 Z 946 "Info1;Info2;Info3"
# 9 6 60 A 800 "Info1;Info2;Info3"
# 10 7 70 B 805 ""
If you're planning on using the col_diff
in a programmatic way later, it might be better to keep that as a list-column
, so that you can test membership with %in%
, etc.
func2 <- function(X) {
replicate(nrow(X), names(
Filter(function(z) length(z) > 1,
lapply(X, unique))
), simplify = FALSE)
}
out <- df %>%
group_by(ID) %>%
mutate(col_diff = func2(cur_data())) %>%
ungroup()
out
# # A tibble: 10 x 5
# ID Info1 Info2 Info3 col_diff
# <dbl> <dbl> <chr> <dbl> <list>
# 1 1 10 A 999 <chr [0]>
# 2 2 20 B 998 <chr [2]>
# 3 2 20 A 997 <chr [2]>
# 4 4 40 D 995 <chr [0]>
# 5 5 50 E 995 <chr [1]>
# 6 5 50 E 995 <chr [1]>
# 7 5 50 F 995 <chr [1]>
# 8 6 65 Z 946 <chr [3]>
# 9 6 60 A 800 <chr [3]>
# 10 7 70 B 805 <chr [0]>
str(out$col_diff)
# List of 10
# $ : chr(0)
# $ : chr [1:2] "Info2" "Info3"
# $ : chr [1:2] "Info2" "Info3"
# $ : chr(0)
# $ : chr "Info2"
# $ : chr "Info2"
# $ : chr "Info2"
# $ : chr [1:3] "Info1" "Info2" "Info3"
# $ : chr [1:3] "Info1" "Info2" "Info3"
# $ : chr(0)
CodePudding user response:
We may use across
with cur_column()
- grouped by 'ID', loop across
the 'Info' columns, if
the number of distinct elements (n_distinct
) in the column is greater than 1, return the column name (cur_column()
) or else return blank (""
), then reduce
the columns to a single string by paste
(str_c
) and remove any leading/lagging delimiter with trimws
library(dplyr)
library(stringr)
library(purrr)
df %>%
group_by(ID) %>%
mutate(col_diff = across(starts_with('Info'),
~ if(n_distinct(.x) > 1) cur_column() else "") %>%
reduce(str_c, sep = "; ") %>%
trimws(whitespace = ";\\s ") ) %>%
ungroup
-output
# A tibble: 10 × 5
ID Info1 Info2 Info3 col_diff
<dbl> <dbl> <chr> <dbl> <chr>
1 1 10 A 999 ""
2 2 20 B 998 "Info2; Info3"
3 2 20 A 997 "Info2; Info3"
4 4 40 D 995 ""
5 5 50 E 995 "Info2"
6 5 50 E 995 "Info2"
7 5 50 F 995 "Info2"
8 6 65 Z 946 "Info1; Info2; Info3"
9 6 60 A 800 "Info1; Info2; Info3"
10 7 70 B 805 ""
CodePudding user response:
Another possible solution:
library(tidyverse)
ID <- c(1,2,2,4,5,5,5,6,6,7)
Info1 <- c(10,20,20,40,50,50,50,65,60,70)
Info2 <- c('A','B','A','D', 'E','E','F', 'Z','A','B')
Info3 <- c(999,998,997,995,995,995,995,946,800,805)
df <- data.frame(ID, Info1, Info2, Info3)
df %>%
group_by(ID) %>%
mutate(across(starts_with("Info"),
~ if_else(n_distinct(.x) != 1, cur_column(), NA_character_))) %>% ungroup %>%
unite("col_diff", starts_with("Info"), na.rm = T, sep = "; ") %>%
na_if("") %>% inner_join(df, ., by = c("ID")) %>% distinct %>%
inner_join(df, ., by = c("ID", "Info1", "Info2", "Info3"))
#> ID Info1 Info2 Info3 col_diff
#> 1 1 10 A 999 <NA>
#> 2 2 20 B 998 Info2; Info3
#> 3 2 20 A 997 Info2; Info3
#> 4 4 40 D 995 <NA>
#> 5 5 50 E 995 Info2
#> 6 5 50 E 995 Info2
#> 7 5 50 F 995 Info2
#> 8 6 65 Z 946 Info1; Info2; Info3
#> 9 6 60 A 800 Info1; Info2; Info3
#> 10 7 70 B 805 <NA>
CodePudding user response:
Since you mentioned you're open to other solutions that may be more 'elegant' - here's a way to report the columns which are heterogeneous within one ID
in a more 'tidy' fashion.
library(tidyverse)
# data
df <- structure(list(ID = c(1, 2, 2, 4, 5, 5, 5, 6, 6, 7), Info1 = c(10, 20, 20, 40, 50, 50, 50, 65, 60, 70), Info2 = c("A", "B", "A", "D", "E", "E", "F", "Z", "A", "B"), Info3 = c(999, 998, 997, 995, 995, 995, 995, 946, 800, 805)), class = "data.frame", row.names = c(NA, -10L))
# enumerate columns that are heterogeneous within one ID in tidy format
changes <- df %>%
group_by(ID) %>%
mutate(across(everything(), ~n_distinct(.x) > 1)) %>%
pivot_longer(-ID, names_to = "col", values_to = "changed") %>%
filter(changed) %>%
select(-changed) %>%
distinct()
# inspect result
changes
#> # A tibble: 6 x 2
#> # Groups: ID [3]
#> ID col
#> <dbl> <chr>
#> 1 2 Info2
#> 2 2 Info3
#> 3 5 Info2
#> 4 6 Info1
#> 5 6 Info2
#> 6 6 Info3
This makes it easy to then report which ID
s have changes in the 'important' columns.
# indicate important columns
important_cols <- c("Info2", "Info3")
# report IDs with changes in important columns
changes %>%
filter(col %in% important_cols)
#> # A tibble: 5 x 2
#> # Groups: ID [3]
#> ID col
#> <dbl> <chr>
#> 1 2 Info2
#> 2 2 Info3
#> 3 5 Info2
#> 4 6 Info2
#> 5 6 Info3
Or even subset the original data to just the 'important' changes so you can manually inspect or further analyze if needed.
# or subset original data for 'important' changes
df %>%
select(ID,
any_of(intersect(unique(changes$col),
important_cols))) %>%
filter(ID %in% changes$ID)
#> ID Info2 Info3
#> 1 2 B 998
#> 2 2 A 997
#> 3 5 E 995
#> 4 5 E 995
#> 5 5 F 995
#> 6 6 Z 946
#> 7 6 A 800
Created on 2022-02-09 by the reprex package (v2.0.1)