Home > Software engineering >  How to Create Repeating Values for All Unique Group Values in a Column in R
How to Create Repeating Values for All Unique Group Values in a Column in R

Time:03-04

I'm trying to make a column sub_species based on a condition from the Species column.

There are three unique values for Species. If Species starts with setosa, then I'd like to repeat setosa1 and setosa2 25 times respectively inside the new column sub_species. The same logic goes for the other two.

Note that each Species value has exactly 50 values, respectively. Hence, the length matches when 25 repetition is used.

library(dplyr)

iris %>% 
  mutate(
    sub_species = case_when(
      startsWith(as.character(Species), "setosa") ~ rep(c("setosa1", "setosa2"), length(1:25))
      startsWith(as.character(Species), "versicolor") ~ rep(c("versicolor1", "versicolor"), length(1:25)),
      startsWith(as.character(Species), "virginica") ~ rep(c("virginica1", "virginica2"), length(1:25))
    )
  )

Error: must be length 150 or one, not 50.

I tried with just setosa separately and it worked. However, it doesn't work when I want to do it as a whole.

CodePudding user response:

You were close: instead of length(1:25) (which many not work as intended), use length.out. It has the added safeguard (in general, not with iris) in ensuring that when you have an odd number of rows, you can produce the perfect amount of sub-species.

iris %>% 
  mutate(
    sub_species = case_when(
      startsWith(as.character(Species), "setosa") ~ rep(c("setosa1", "setosa2"), length.out = n()),
      startsWith(as.character(Species), "versicolor") ~ rep(c("versicolor1", "versicolor2"), length.out = n()),
      startsWith(as.character(Species), "virginica") ~ rep(c("virginica1", "virginica2"), length.out = n())
    )
  ) %>%
  head()
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species sub_species
# 1          5.1         3.5          1.4         0.2  setosa     setosa1
# 2          4.9         3.0          1.4         0.2  setosa     setosa2
# 3          4.7         3.2          1.3         0.2  setosa     setosa1
# 4          4.6         3.1          1.5         0.2  setosa     setosa2
# 5          5.0         3.6          1.4         0.2  setosa     setosa1
# 6          5.4         3.9          1.7         0.4  setosa     setosa2

base R

paste0 by itself might be okay, but like in the dplyr example above, this code will be a little safer if there are not an even number of rows.

iris$sub_species <- paste0(
  iris$Species,
  ave(seq_len(nrow(iris)), iris$Species,
      FUN = function(z) rep(1:2, length.out = length(z)))
)
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species sub_species
# 1          5.1         3.5          1.4         0.2  setosa     setosa1
# 2          4.9         3.0          1.4         0.2  setosa     setosa2
# 3          4.7         3.2          1.3         0.2  setosa     setosa1
# 4          4.6         3.1          1.5         0.2  setosa     setosa2
# 5          5.0         3.6          1.4         0.2  setosa     setosa1
# 6          5.4         3.9          1.7         0.4  setosa     setosa2

The seq_len(nrow(iris)) is because ave requires the return-value to be the same class as the first argument; since we want numbers, I gave it numbers. We don't care what they are, but they must be the same length. (I could have used one of the numeric columns, but I wanted my intentions here clear.)

  • Related