How can I identify the column, where two (or more) rows are different (same ID)?-CodePudding

I have a dataframe with duplicate IDs, but the rows are not identical. Is there a function, which identifies the column (or columns), where a difference appears?

My real application is a dataframe with hundreds of columns. I need some way to check, whether changes were made in important columns or in some irrelevant ones. So first of all I need to identify the changed columns.

Example:

ID <- c(1,2,2,4,5,5,5,6,6,7)
Info1 <- c(10,20,20,40,50,50,50,65,60,70)
Info2 <- c('A','B','A','D', 'E','E','F', 'Z','A','B')
Info3 <- c(999,998,997,995,995,995,995,946,800,805)

df <- data.frame(ID, Info1, Info2, Info3)

   ID Info1 Info2 Info3
1   1    10     A   999
2   2    20     B   998
3   2    20     A   997
4   4    40     D   995
5   5    50     E   995
6   5    50     E   995
7   5    50     F   995
8   6    60     Z   946
9   6    60     A   800
10  7    70     B   805

My goal would be an additional column, which contains the changed column, i.e. desired output:

   ID Info1 Info2 Info3            col_diff
1   1    10     A   999                <NA>
2   2    20     B   998        Info2; Info3
3   2    20     A   997        Info2; Info3
4   4    40     D   995                <NA>
5   5    50     H   995               Info2
6   5    50     E   995               Info2
7   5    50     F   995               Info2
8   6    65     Z   946 Info1; Info2; Info3
9   6    60     A   800 Info1; Info2; Info3
10  7    70     B   805                <NA>

I hope my problem became clear. I hope there is some function within dplyr, which I do not know yet. Of course my solution with the additional column is not really elegant. So I am open to any ideas, which may solve my problem.

Thanks a lot!

CodePudding user response：

func <- function(X) {
  paste(names(
    Filter(function(z) length(z) > 1,
           lapply(X, unique))
  ), collapse = ";")
}

df %>%
  group_by(ID) %>%
  mutate(col_diff = func(cur_data())) %>%
  ungroup()
# # A tibble: 10 x 5
#       ID Info1 Info2 Info3 col_diff           
#    <dbl> <dbl> <chr> <dbl> <chr>              
#  1     1    10 A       999 ""                 
#  2     2    20 B       998 "Info2;Info3"      
#  3     2    20 A       997 "Info2;Info3"      
#  4     4    40 D       995 ""                 
#  5     5    50 E       995 "Info2"            
#  6     5    50 E       995 "Info2"            
#  7     5    50 F       995 "Info2"            
#  8     6    65 Z       946 "Info1;Info2;Info3"
#  9     6    60 A       800 "Info1;Info2;Info3"
# 10     7    70 B       805 ""

If you're planning on using the col_diff in a programmatic way later, it might be better to keep that as a list-column, so that you can test membership with %in%, etc.

func2 <- function(X) {
  replicate(nrow(X), names(
    Filter(function(z) length(z) > 1,
           lapply(X, unique))
  ), simplify = FALSE)
}
out <- df %>%
  group_by(ID) %>%
  mutate(col_diff = func2(cur_data())) %>%
  ungroup()
out
# # A tibble: 10 x 5
#       ID Info1 Info2 Info3 col_diff 
#    <dbl> <dbl> <chr> <dbl> <list>   
#  1     1    10 A       999 <chr [0]>
#  2     2    20 B       998 <chr [2]>
#  3     2    20 A       997 <chr [2]>
#  4     4    40 D       995 <chr [0]>
#  5     5    50 E       995 <chr [1]>
#  6     5    50 E       995 <chr [1]>
#  7     5    50 F       995 <chr [1]>
#  8     6    65 Z       946 <chr [3]>
#  9     6    60 A       800 <chr [3]>
# 10     7    70 B       805 <chr [0]>
str(out$col_diff)
# List of 10
#  $ : chr(0) 
#  $ : chr [1:2] "Info2" "Info3"
#  $ : chr [1:2] "Info2" "Info3"
#  $ : chr(0) 
#  $ : chr "Info2"
#  $ : chr "Info2"
#  $ : chr "Info2"
#  $ : chr [1:3] "Info1" "Info2" "Info3"
#  $ : chr [1:3] "Info1" "Info2" "Info3"
#  $ : chr(0)

CodePudding user response：

We may use across with cur_column() - grouped by 'ID', loop across the 'Info' columns, if the number of distinct elements (n_distinct) in the column is greater than 1, return the column name (cur_column()) or else return blank (""), then reduce the columns to a single string by paste (str_c) and remove any leading/lagging delimiter with trimws

library(dplyr)
library(stringr)
library(purrr)
df %>% 
 group_by(ID) %>%
 mutate(col_diff = across(starts_with('Info'), 
  ~ if(n_distinct(.x) > 1) cur_column() else "") %>%
   reduce(str_c, sep = "; ") %>% 
      trimws(whitespace = ";\\s ") ) %>%
  ungroup

-output

# A tibble: 10 × 5
      ID Info1 Info2 Info3 col_diff             
   <dbl> <dbl> <chr> <dbl> <chr>                
 1     1    10 A       999 ""                   
 2     2    20 B       998 "Info2; Info3"       
 3     2    20 A       997 "Info2; Info3"       
 4     4    40 D       995 ""                   
 5     5    50 E       995 "Info2"              
 6     5    50 E       995 "Info2"              
 7     5    50 F       995 "Info2"              
 8     6    65 Z       946 "Info1; Info2; Info3"
 9     6    60 A       800 "Info1; Info2; Info3"
10     7    70 B       805 ""

CodePudding user response：

Another possible solution:

library(tidyverse)

ID <- c(1,2,2,4,5,5,5,6,6,7)
Info1 <- c(10,20,20,40,50,50,50,65,60,70)
Info2 <- c('A','B','A','D', 'E','E','F', 'Z','A','B')
Info3 <- c(999,998,997,995,995,995,995,946,800,805)

df <- data.frame(ID, Info1, Info2, Info3)
df %>% 
  group_by(ID) %>% 
  mutate(across(starts_with("Info"),
   ~ if_else(n_distinct(.x) != 1, cur_column(), NA_character_))) %>% ungroup %>% 
  unite("col_diff", starts_with("Info"), na.rm = T, sep = "; ") %>% 
  na_if("") %>% inner_join(df, ., by = c("ID")) %>% distinct %>% 
  inner_join(df, ., by = c("ID", "Info1", "Info2", "Info3"))

#>    ID Info1 Info2 Info3            col_diff
#> 1   1    10     A   999                <NA>
#> 2   2    20     B   998        Info2; Info3
#> 3   2    20     A   997        Info2; Info3
#> 4   4    40     D   995                <NA>
#> 5   5    50     E   995               Info2
#> 6   5    50     E   995               Info2
#> 7   5    50     F   995               Info2
#> 8   6    65     Z   946 Info1; Info2; Info3
#> 9   6    60     A   800 Info1; Info2; Info3
#> 10  7    70     B   805                <NA>

CodePudding user response：

Since you mentioned you're open to other solutions that may be more 'elegant' - here's a way to report the columns which are heterogeneous within one ID in a more 'tidy' fashion.

library(tidyverse)

# data
df <- structure(list(ID = c(1, 2, 2, 4, 5, 5, 5, 6, 6, 7), Info1 = c(10, 20, 20, 40, 50, 50, 50, 65, 60, 70), Info2 = c("A", "B", "A", "D", "E", "E", "F", "Z", "A", "B"), Info3 = c(999, 998, 997, 995, 995, 995, 995, 946, 800, 805)), class = "data.frame", row.names = c(NA, -10L))

# enumerate columns that are heterogeneous within one ID in tidy format
changes <- df %>% 
  group_by(ID) %>% 
  mutate(across(everything(), ~n_distinct(.x) > 1)) %>% 
  pivot_longer(-ID, names_to = "col", values_to = "changed") %>% 
  filter(changed) %>% 
  select(-changed) %>% 
  distinct()

# inspect result
changes
#> # A tibble: 6 x 2
#> # Groups:   ID [3]
#>      ID col  
#>   <dbl> <chr>
#> 1     2 Info2
#> 2     2 Info3
#> 3     5 Info2
#> 4     6 Info1
#> 5     6 Info2
#> 6     6 Info3

This makes it easy to then report which IDs have changes in the 'important' columns.

# indicate important columns
important_cols <- c("Info2", "Info3")

# report IDs with changes in important columns
changes %>% 
  filter(col %in% important_cols) 
#> # A tibble: 5 x 2
#> # Groups:   ID [3]
#>      ID col  
#>   <dbl> <chr>
#> 1     2 Info2
#> 2     2 Info3
#> 3     5 Info2
#> 4     6 Info2
#> 5     6 Info3

Or even subset the original data to just the 'important' changes so you can manually inspect or further analyze if needed.

# or subset original data for 'important' changes
df %>%
  select(ID,
         any_of(intersect(unique(changes$col),
                          important_cols))) %>%
  filter(ID %in% changes$ID)
#>   ID Info2 Info3
#> 1  2     B   998
#> 2  2     A   997
#> 3  5     E   995
#> 4  5     E   995
#> 5  5     F   995
#> 6  6     Z   946
#> 7  6     A   800

^{Created on 2022-02-09 by the reprex package (v2.0.1)}