I am trying to filter a grouped data frame where:
- Every duplicate in one column ("data_val") including the first instance of the duplicate is dropped.
- The row values in the columns corresponding to the duplicate columns are deleted as well (even if they are not duplicates).
It is important to note that I am grouping by more than one variable: "sources" and "db_source".
So, given:
example_data <- data.frame(sources = rep(c("S1", "S2", "S3"), each = 4),
data_val = rep(c(1, 1, 2, 3), 3),
db_source = rep(c("DB1", "DB2"), 6))
example_data <- example_data[order(example_data$sources, example_data$data_val), ]
sources data_val db_source
S1 1 DB1
S1 1 DB2
S1 2 DB1
S1 3 DB2
S2 1 DB1
S2 1 DB2
S2 2 DB1
S2 3 DB2
S3 1 DB1
S3 1 DB2
S3 2 DB1
S3 3 DB2
I want the following:
sources data_val db_source
S1 2 DB1
S1 3 DB2
S2 2 DB1
S2 3 DB2
S3 2 DB1
S3 3 DB2
I tried using functions like duplicated()
and distinct()
in my pipe, but they will return the following:
sources data_val db_source
S1 1 DB1
S1 2 DB1
S1 3 DB2
S2 1 DB1
S2 2 DB1
S2 3 DB2
S3 1 DB1
S3 2 DB1
S3 3 DB2
I understand the listed functions return the above because they check the vectors sequentially, but if there is a way to drop the first instance that would be great.
Thank you to anyone who can help.
CodePudding user response:
dplyr
library(dplyr)
example_data %>%
group_by(sources, db_source) %>%
slice(n()) %>%
ungroup()
# # A tibble: 4 x 3
# sources data_val db_source
# <chr> <dbl> <chr>
# 1 S1 2 DB1
# 2 S1 3 DB2
# 3 S2 2 DB1
# 4 S2 3 DB2
base R
Either one works:
example_data[ave(rep(TRUE, nrow(example_data)), example_data[c("sources","db_source")], FUN = function(z) seq_along(z) == length(z)),]
# sources data_val db_source
# 3 S1 2 DB1
# 4 S1 3 DB2
# 7 S2 2 DB1
# 8 S2 3 DB2
do.call(rbind, by(example_data, example_data[c("sources","db_source")], function(z) z[nrow(z),]))
# sources data_val db_source
# 3 S1 2 DB1
# 7 S2 2 DB1
# 4 S1 3 DB2
# 8 S2 3 DB2
CodePudding user response:
Another optione would be using lag
function:
library(dplyr)
example_data %>%
group_by(sources) %>%
filter(data_val != lag(data_val)) %>%
ungroup()
sources data_val db_source
<chr> <dbl> <chr>
1 S1 2 DB1
2 S1 3 DB2
3 S2 2 DB1
4 S2 3 DB2