Home > Net >  Drop every duplicate value including the first instance of that value in R
Drop every duplicate value including the first instance of that value in R

Time:05-09

I am trying to filter a grouped data frame where:

  • Every duplicate in one column ("data_val") including the first instance of the duplicate is dropped.
  • The row values in the columns corresponding to the duplicate columns are deleted as well (even if they are not duplicates).

It is important to note that I am grouping by more than one variable: "sources" and "db_source".

So, given:

example_data <- data.frame(sources = rep(c("S1", "S2", "S3"), each = 4),
                           data_val = rep(c(1, 1, 2, 3), 3),
                           db_source = rep(c("DB1", "DB2"), 6))

example_data <- example_data[order(example_data$sources, example_data$data_val), ]

sources data_val db_source
    S1        1       DB1
    S1        1       DB2
    S1        2       DB1
    S1        3       DB2
    S2        1       DB1
    S2        1       DB2
    S2        2       DB1
    S2        3       DB2
    S3        1       DB1
    S3        1       DB2
    S3        2       DB1
    S3        3       DB2

I want the following:

sources data_val db_source
    S1        2       DB1
    S1        3       DB2
    S2        2       DB1
    S2        3       DB2
    S3        2       DB1
    S3        3       DB2

I tried using functions like duplicated() and distinct() in my pipe, but they will return the following:

sources data_val db_source
    S1        1       DB1
    S1        2       DB1
    S1        3       DB2
    S2        1       DB1
    S2        2       DB1
    S2        3       DB2
    S3        1       DB1
    S3        2       DB1
    S3        3       DB2

I understand the listed functions return the above because they check the vectors sequentially, but if there is a way to drop the first instance that would be great.

Thank you to anyone who can help.

CodePudding user response:

dplyr

library(dplyr)
example_data %>%
  group_by(sources, db_source) %>%
  slice(n()) %>%
  ungroup()
# # A tibble: 4 x 3
#   sources data_val db_source
#   <chr>      <dbl> <chr>    
# 1 S1             2 DB1      
# 2 S1             3 DB2      
# 3 S2             2 DB1      
# 4 S2             3 DB2      

base R

Either one works:

example_data[ave(rep(TRUE, nrow(example_data)), example_data[c("sources","db_source")], FUN = function(z) seq_along(z) == length(z)),]
#   sources data_val db_source
# 3      S1        2       DB1
# 4      S1        3       DB2
# 7      S2        2       DB1
# 8      S2        3       DB2

do.call(rbind, by(example_data, example_data[c("sources","db_source")], function(z) z[nrow(z),]))
#   sources data_val db_source
# 3      S1        2       DB1
# 7      S2        2       DB1
# 4      S1        3       DB2
# 8      S2        3       DB2

CodePudding user response:

Another optione would be using lag function:

library(dplyr)

example_data %>%
  group_by(sources) %>% 
  filter(data_val != lag(data_val)) %>% 
  ungroup()


  sources data_val db_source
  <chr>      <dbl> <chr>    
1 S1             2 DB1      
2 S1             3 DB2      
3 S2             2 DB1      
4 S2             3 DB2  
  • Related