Writing a function that takes a vector as input, throws away unwanted values, de-duplicates, and ret-CodePudding

I'm trying to write a function that takes in a vector and subsets it according to several steps:

Throws away any unwanted values
Removes duplicates.
Returns the indexes of the original vector after accounting for steps (1) and (2).

For example, provided with the following input vector:

vec_animals <- c("dog", "dog", "dog", "dog", "cat", "dolphin", "dolphin")

and

throw_away_val <- "cat"

I want my function get_indexes(x = vec_animals, y = throw_away_val) to return:

# [1] 1 6   # `1` is the index of the 1st unique ("dog") in `vec_animals`, `6` is the index of the 2nd unique ("dolphin")

Another example

vec_years <- c(2003, 2003, 2003, 2007, 2007, 2011, 2011, 2011)
throw_away_val <- 2003

Return:

# [1] 4 6 # `4` is the position of 1st unique (`2007`) after throwing away unwanted val; `6` is the position of 2nd unique (`2011`).

My initial attempt

The following function returns indexes but doesn't account for duplicates

get_index <- function(x, throw_away) {
  which(x != throw_away)
}

which then returns the indexes of the original vec_animals such as:

get_index(vec_animals, "cat")
#> [1] 1 2 3 4 6 7

If we use this output to subset vec_animal we get:

vec_animals[get_index(vec_animals, "cat")]
#> [1] "dog"     "dog"     "dog"     "dog"     "dolphin" "dolphin"

You could have suggested to operate on this output such as:

vec_animals[get_index(vec_animals, "cat")] |> unique()
#> [1] "dog"     "dolphin"

But no, I need get_index() to return the correct indexes right away (in this case 1 and 6).

EDIT

A relevant procedure in which we can get the indexes of first occurrences of duplicates is provided with

library(bit64)

vec_num <- as.integer64(c(4, 2, 2, 3, 3, 3, 3, 100, 100))
unipos(vec_num)
#> [1] 1 2 4 8

Or more generally

which(!duplicated(vec_num))
#> [1] 1 2 4 8

Such solutions would have been great if had not needed to also throw away unwanted values.

CodePudding user response：

Try:

get_index <- function(x, throw_away) {
  which(!duplicated(x) & x!=throw_away)
  }

> get_index(vec_animals, "cat")
[1] 1 6

CodePudding user response：

Here is a simple self-written function that provides the needed information.

vec_animals <- c("dog", "dog", "dog", "dog", "cat", "dolphin", "dolphin")

get_indexes <- function(x, throw_away){
  elements <- (unique(x))[(unique(x)) != throw_away]
  index <- lapply(1:length(elements), function(i) {which(x %in% elements[i]) })
  index2return <- c()
  for (j in 1:length(index)) {
    index2return <- c(index2return, min(index[[j]]))
  }
  return(index2return)
}

get_indexes(x = vec_animals, throw_away = "cat")
[1] 1 6

CodePudding user response：

My approach :

vec_animals <- c("dog", "dog", "dog", "dog", "cat", "dolphin", "dolphin")
throw_away_val <- "cat"

my_function <- function(x, y) {
my_df <- data.frame("Origin" = x,
                  "Position" = seq.int(from = 1, to = length(x), by = 1),
                  stringsAsFactors = FALSE)
my_var <- which(my_df$Origin %in% y)
if (length(my_var)) {
my_df <- my_df[-my_var,]
}
my_df <- my_df[!duplicated(my_df$Origin),]
return (my_df)
}

my_df <- my_function(vec_animals, throw_away_val)