Home > database >  Create a dataframe with all observations unique for one specific column of a dataframe in R
Create a dataframe with all observations unique for one specific column of a dataframe in R

Time:07-19

I have a dataframe that I would like to reduce in size by extracting the unique observations. However, I would like to only select the unique observations of one column, and preserve the rest of the dataframe. Because there are certain other columns that have repeat values, I cannot simply put the entire dataframe in the unique function. How can I do this and produce the entire dataframe?

For example, with the following dataframe, I would like to only reduce the dataframe by unique observations of variable a (column 1):

a b c d e

1 2 3 4 5

1 2 3 4 6

3 4 5 6 8

4 5 2 3 6

Therefore, I only remove row 2, because "1" is repeated. The other rows/columns repeat values, but these observations are maintained, because I only assess the uniqueness of column 1 (a).

Desired outcome:

a b c d e

1 2 3 4 5

3 4 5 6 8

4 5 2 3 6

How can I process this and then retrieve the entire dataframe? Is there a configuration for the unique function to do this, or do I need an alternative?

CodePudding user response:

base R

dat[!duplicated(dat$a),]
#   a b c d e
# 1 1 2 3 4 5
# 3 3 4 5 6 8
# 4 4 5 2 3 6

dplyr

dplyr::distinct(dat, a, .keep_all = TRUE)
#   a b c d e
# 1 1 2 3 4 5
# 2 3 4 5 6 8
# 3 4 5 2 3 6

Another option: per-group, pick a particular value from the duplicated rows.

library(dplyr)
dat %>%
  group_by(a) %>%
  slice(which.max(e)) %>%
  ungroup()
# # A tibble: 3 x 5
#       a     b     c     d     e
#   <int> <int> <int> <int> <int>
# 1     1     2     3     4     6
# 2     3     4     5     6     8
# 3     4     5     2     3     6

library(data.table)
as.data.table(dat)[, .SD[which.max(e),], by = .(a) ]
#        a     b     c     d     e
#    <int> <int> <int> <int> <int>
# 1:     1     2     3     4     6
# 2:     3     4     5     6     8
# 3:     4     5     2     3     6

As for unique, it does not have incomparables argument, but it is not yet implemented:

unique(dat, incomparables = c("b", "c", "d", "e"))
# Error: argument 'incomparables != FALSE' is not used (yet)
  • Related