Have a data frame full of species
names. I need to create a species_code
variable that has the first 4 characters of the genus and the first 3 characters of the second part of the species name with no spaces.
Sometimes the second species name is not known and then it is abbreviated as "sp." or "spp." If it is "spp.", it breaks the rule of 3 characters (it has 4 incl. the period). In this case the ending should be "sp." as well.
Here is a step by step example and how I usually go about it. I am certain there are way more elegant solutions to this and I was hoping someone can help me with this. What I am wondering is:
(a) is there another way instead of str_match()
within the stringr
package? I tried str_extract()
but that doesn't extracts the matches within the parentheses, i.e. the pieces that I need (see step2
below; can this be made more concise?!)
(b) can step3
be solved in the regex (see explanation in 2nd paragraph above)?
tibble(
# species names
species = c("CALLIERGON GIGANTEUM", "CEPHALOZIELLA SP.", "LICHEN SPP."),
# how the species code should look like after the regex
species_code = c("CALLGIG", "CEPHSP.", "LICHSP.")
) %>%
mutate(
step1 = str_match(species, "(\\w{4})\\w*\\s (\\w{1,3}\\.?)\\w*"),
step2 = paste0(step1[, 2], step1[, 3]),
step3 = str_replace(step2, "SPP.", "SP.")
) -> almost_done
almost_done
# A tibble: 3 × 5
# species species_code step1[,1] [,2] [,3] step2 step3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 CALLIERGON GIGANTEUM CALLGIG CALLIERGON GIGANTEUM CALL GIG CALLGIG CALLGIG
#2 CEPHALOZIELLA SP. CEPHSP. CEPHALOZIELLA SP. CEPH SP. CEPHSP. CEPHSP.
#3 LICHEN SPP. LICHSP. LICHEN SPP. LICH SPP. LICHSPP. LICHSP.
almost_done %>%
select(!(3:4)) -> done
done
# A tibble: 3 × 3
# species species_code step3
# <chr> <chr> <chr>
#1 CALLIERGON GIGANTEUM CALLGIG CALLGIG
#2 CEPHALOZIELLA SP. CEPHSP. CEPHSP.
#3 LICHEN SPP. LICHSP. LICHSP.
CodePudding user response:
Does this work:
library(dplyr)
library(stringr)
df %>% mutate(newcol = str_c(str_extract(species, '[A-Z]{4}'), str_extract(str_replace(species_code, 'SPP\\.$', 'SP\\.'), '[A-Z.]{3}$')))
# A tibble: 3 × 3
species species_code newcol
<chr> <chr> <chr>
1 CALLIERGON GIGANTEUM CALLGIG CALLGIG
2 CEPHALOZIELLA SP. CEPHSP. CEPHSP.
3 LICHEN SPP. LICHSP. LICHSP.
CodePudding user response:
If you only want regex, a tough approach in base R will be
pat <- '^(.{4}).* (?(?=SPP?\\.)(SP)P?(.)|(.{3}).*)'
transform(df, code = sub(pat, "\\1\\2\\3\\4", species, perl = TRUE))
species species_code code
1 CALLIERGON GIGANTEUM CALLGIG CALLGIG
2 CEPHALOZIELLA SP. CEPHSP. CEPHSP.
3 LICHEN SPP. LICHSP. LICHSP.