Drop every duplicate value including the first instance of that value in R-CodePudding

I am trying to filter a grouped data frame where:

Every duplicate in one column ("data_val") including the first instance of the duplicate is dropped.
The row values in the columns corresponding to the duplicate columns are deleted as well (even if they are not duplicates).

It is important to note that I am grouping by more than one variable: "sources" and "db_source".

So, given:

example_data <- data.frame(sources = rep(c("S1", "S2", "S3"), each = 4),
                           data_val = rep(c(1, 1, 2, 3), 3),
                           db_source = rep(c("DB1", "DB2"), 6))

example_data <- example_data[order(example_data$sources, example_data$data_val), ]

sources data_val db_source
    S1        1       DB1
    S1        1       DB2
    S1        2       DB1
    S1        3       DB2
    S2        1       DB1
    S2        1       DB2
    S2        2       DB1
    S2        3       DB2
    S3        1       DB1
    S3        1       DB2
    S3        2       DB1
    S3        3       DB2

I want the following:

sources data_val db_source
    S1        2       DB1
    S1        3       DB2
    S2        2       DB1
    S2        3       DB2
    S3        2       DB1
    S3        3       DB2

I tried using functions like duplicated() and distinct() in my pipe, but they will return the following:

sources data_val db_source
    S1        1       DB1
    S1        2       DB1
    S1        3       DB2
    S2        1       DB1
    S2        2       DB1
    S2        3       DB2
    S3        1       DB1
    S3        2       DB1
    S3        3       DB2

I understand the listed functions return the above because they check the vectors sequentially, but if there is a way to drop the first instance that would be great.

Thank you to anyone who can help.

CodePudding user response：

dplyr

library(dplyr)
example_data %>%
  group_by(sources, db_source) %>%
  slice(n()) %>%
  ungroup()
# # A tibble: 4 x 3
#   sources data_val db_source
#   <chr>      <dbl> <chr>    
# 1 S1             2 DB1      
# 2 S1             3 DB2      
# 3 S2             2 DB1      
# 4 S2             3 DB2

base R

Either one works:

example_data[ave(rep(TRUE, nrow(example_data)), example_data[c("sources","db_source")], FUN = function(z) seq_along(z) == length(z)),]
#   sources data_val db_source
# 3      S1        2       DB1
# 4      S1        3       DB2
# 7      S2        2       DB1
# 8      S2        3       DB2

do.call(rbind, by(example_data, example_data[c("sources","db_source")], function(z) z[nrow(z),]))
#   sources data_val db_source
# 3      S1        2       DB1
# 7      S2        2       DB1
# 4      S1        3       DB2
# 8      S2        3       DB2

CodePudding user response：

Another optione would be using lag function:

library(dplyr)

example_data %>%
  group_by(sources) %>% 
  filter(data_val != lag(data_val)) %>% 
  ungroup()


  sources data_val db_source
  <chr>      <dbl> <chr>    
1 S1             2 DB1      
2 S1             3 DB2      
3 S2             2 DB1      
4 S2             3 DB2