Extract the first 4 letters of the first word and the first 3 letters of the second word and combine-CodePudding

Have a data frame full of species names. I need to create a species_code variable that has the first 4 characters of the genus and the first 3 characters of the second part of the species name with no spaces.

Sometimes the second species name is not known and then it is abbreviated as "sp." or "spp." If it is "spp.", it breaks the rule of 3 characters (it has 4 incl. the period). In this case the ending should be "sp." as well.

Here is a step by step example and how I usually go about it. I am certain there are way more elegant solutions to this and I was hoping someone can help me with this. What I am wondering is:

(a) is there another way instead of str_match() within the stringr package? I tried str_extract() but that doesn't extracts the matches within the parentheses, i.e. the pieces that I need (see step2 below; can this be made more concise?!)

(b) can step3 be solved in the regex (see explanation in 2nd paragraph above)?

tibble(
  # species names
  species = c("CALLIERGON GIGANTEUM", "CEPHALOZIELLA SP.", "LICHEN SPP."),
  # how the species code should look like after the regex
  species_code = c("CALLGIG", "CEPHSP.", "LICHSP.")
) %>%
  mutate(
    step1 = str_match(species, "(\\w{4})\\w*\\s (\\w{1,3}\\.?)\\w*"),
    step2 = paste0(step1[, 2], step1[, 3]),
    step3 = str_replace(step2, "SPP.", "SP.")
  ) -> almost_done 
 
 almost_done
# A tibble: 3 × 5
#  species              species_code step1[,1]            [,2]  [,3]  step2    step3  
#  <chr>                <chr>        <chr>                <chr> <chr> <chr>    <chr>  
#1 CALLIERGON GIGANTEUM CALLGIG      CALLIERGON GIGANTEUM CALL  GIG   CALLGIG  CALLGIG
#2 CEPHALOZIELLA SP.    CEPHSP.      CEPHALOZIELLA SP.    CEPH  SP.   CEPHSP.  CEPHSP.
#3 LICHEN SPP.          LICHSP.      LICHEN SPP.          LICH  SPP.  LICHSPP. LICHSP.
 
almost_done %>% 
   select(!(3:4)) -> done
 
 done
# A tibble: 3 × 3
#  species              species_code step3  
#  <chr>                <chr>        <chr>  
#1 CALLIERGON GIGANTEUM CALLGIG      CALLGIG
#2 CEPHALOZIELLA SP.    CEPHSP.      CEPHSP.
#3 LICHEN SPP.          LICHSP.      LICHSP.

CodePudding user response：

Does this work:

library(dplyr)
library(stringr)
df %>% mutate(newcol = str_c(str_extract(species, '[A-Z]{4}'), str_extract(str_replace(species_code, 'SPP\\.$', 'SP\\.'), '[A-Z.]{3}$')))
# A tibble: 3 × 3
  species              species_code newcol 
  <chr>                <chr>        <chr>  
1 CALLIERGON GIGANTEUM CALLGIG      CALLGIG
2 CEPHALOZIELLA SP.    CEPHSP.      CEPHSP.
3 LICHEN SPP.          LICHSP.      LICHSP.

CodePudding user response：

If you only want regex, a tough approach in base R will be

pat <- '^(.{4}).* (?(?=SPP?\\.)(SP)P?(.)|(.{3}).*)'

transform(df, code = sub(pat, "\\1\\2\\3\\4", species, perl = TRUE))

               species species_code    code
1 CALLIERGON GIGANTEUM      CALLGIG CALLGIG
2    CEPHALOZIELLA SP.      CEPHSP. CEPHSP.
3          LICHEN SPP.      LICHSP. LICHSP.