I'm trying to make a column sub_species
based on a condition from the Species
column.
There are three unique values for Species
. If Species
starts with setosa
, then I'd like to repeat setosa1
and setosa2
25 times respectively inside the new column sub_species
. The same logic goes for the other two.
Note that each Species
value has exactly 50 values, respectively. Hence, the length matches when 25 repetition is used.
library(dplyr)
iris %>%
mutate(
sub_species = case_when(
startsWith(as.character(Species), "setosa") ~ rep(c("setosa1", "setosa2"), length(1:25))
startsWith(as.character(Species), "versicolor") ~ rep(c("versicolor1", "versicolor"), length(1:25)),
startsWith(as.character(Species), "virginica") ~ rep(c("virginica1", "virginica2"), length(1:25))
)
)
Error: must be length 150 or one, not 50.
I tried with just setosa
separately and it worked. However, it doesn't work when I want to do it as a whole.
CodePudding user response:
You were close: instead of length(1:25)
(which many not work as intended), use length.out
. It has the added safeguard (in general, not with iris
) in ensuring that when you have an odd number of rows, you can produce the perfect amount of sub-species.
iris %>%
mutate(
sub_species = case_when(
startsWith(as.character(Species), "setosa") ~ rep(c("setosa1", "setosa2"), length.out = n()),
startsWith(as.character(Species), "versicolor") ~ rep(c("versicolor1", "versicolor2"), length.out = n()),
startsWith(as.character(Species), "virginica") ~ rep(c("virginica1", "virginica2"), length.out = n())
)
) %>%
head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species sub_species
# 1 5.1 3.5 1.4 0.2 setosa setosa1
# 2 4.9 3.0 1.4 0.2 setosa setosa2
# 3 4.7 3.2 1.3 0.2 setosa setosa1
# 4 4.6 3.1 1.5 0.2 setosa setosa2
# 5 5.0 3.6 1.4 0.2 setosa setosa1
# 6 5.4 3.9 1.7 0.4 setosa setosa2
base R
paste0
by itself might be okay, but like in the dplyr example above, this code will be a little safer if there are not an even number of rows.
iris$sub_species <- paste0(
iris$Species,
ave(seq_len(nrow(iris)), iris$Species,
FUN = function(z) rep(1:2, length.out = length(z)))
)
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species sub_species
# 1 5.1 3.5 1.4 0.2 setosa setosa1
# 2 4.9 3.0 1.4 0.2 setosa setosa2
# 3 4.7 3.2 1.3 0.2 setosa setosa1
# 4 4.6 3.1 1.5 0.2 setosa setosa2
# 5 5.0 3.6 1.4 0.2 setosa setosa1
# 6 5.4 3.9 1.7 0.4 setosa setosa2
The seq_len(nrow(iris))
is because ave
requires the return-value to be the same class as the first argument; since we want numbers, I gave it numbers. We don't care what they are, but they must be the same length. (I could have used one of the numeric columns, but I wanted my intentions here clear.)