Extracting localities with stringr-CodePudding

I am trying to extract some phrases (localities) from some strings, and have put several examples below. The function searches the text after this first ref and would go to the first ., but ignoring cases like km., ca. and U.S. (the latter for when it is U.S.A.).

However, the ca within the pattern in the function ends up ignoring the a. in Colombia. in the first example. If I test with Colombie, it works as expected. All the others are working, only the first example doesn't work yet.

library(dplyr)
library(stringr)

places <- tibble(
  text = c(
    "versity Studies No. 19; ref. 1259] Villavicencio, Río Orinoco system, Colombia. Holotype: FMNH 56646 [ex CM 5463]. Paratypes: (6) ",
    "versity Studies No. 19; ref. 1259] Villavicencio, Río Orinoco system, Colombie. Holotype: FMNH 56646 [ex CM 5463]. Paratypes: (6) ",
    "ions No. 457 (art. 17); ref. 2247] Roadside pool, 3 kilometers south of Progreso, Yucatán, southeastern Mexico. Holotype: UMMZ 102144. Paratypes: UMMZ 102145 (3). •Syn",
    "for 2001); ref. 27036] La Sapayera stream, near the bridge km. 2 vía Guadalupe-Florencia, 1°59'27\"N, 75°45'08\"W, afluent of La Viciosa stream, río Suaza system, upper Magdalena, Cachimba trail, Municipality of Guadalupe, Department of Huila, Colombia, elevation 952 meters. Holotype: IUQ 422. Par",
    "); ref. 30495] Rio Iguaçu, ca. 25°43'S, 49°44'W, Serrinha, Paraná, Brazil. Holotype: FMN",
    ". 4 (no. 2) (art. 7); ref. 1261] Corumba, río Otuquis, Ascunción, Paraguay. Holo",
    "); ref. 14367] Not: Puerto Mexico [Coatzacoalcos], Veracruz State, Isthmus of Tehuantepec, Mexico [probably Panama]. Holotype",
    "hia v. 7; ref. 168] Upper tributaries of Nueces River, Texas, U.S.A. Lectotype: BMNH 1883.12.14.107. Paralectotype"
  )
)


places %>% 
  mutate(
    type_locality = stringr::str_extract(text, "(?<=ref.\\s[:digit:]{1,6}\\]\\s).*?[^\\s(km|ca|U\\.S\\.)\\.](?=\\.\\s)")) %>% 
  relocate(type_locality)
#> # A tibble: 8 x 2
#>   type_locality                                                            text 
#>   <chr>                                                                    <chr>
#> 1 "Villavicencio, Río Orinoco system, Colombia. Holotype: FMNH 56646 [ex ~ "ver~
#> 2 "Villavicencio, Río Orinoco system, Colombie"                            "ver~
#> 3 "Roadside pool, 3 kilometers south of Progreso, Yucatán, southeastern M~ "ion~
#> 4 "La Sapayera stream, near the bridge km. 2 vía Guadalupe-Florencia, 1°5~ "for~
#> 5 "Rio Iguaçu, ca. 25°43'S, 49°44'W, Serrinha, Paraná, Brazil"             "); ~
#> 6 "Corumba, río Otuquis, Ascunción, Paraguay"                              ". 4~
#> 7 "Not: Puerto Mexico [Coatzacoalcos], Veracruz State, Isthmus of Tehuant~ "); ~
#> 8 "Upper tributaries of Nueces River, Texas, U.S.A"                        "hia~

^{Created on 2022-12-19 with reprex v2.0.2}

CodePudding user response：

You can use

stringr::str_extract(text, "(?<=ref\\.\\s[0-9]{1,6}]\\s)(?:\\b(?:km|ca|U\\.S)\\.|.)*?(?=\\.\\s)")

See the regex demo.

Details:

(?<=ref\.\s[0-9]{1,6}]\s) - a positive lookbehind that matches a location that is immediately preceded with ref. whitespace one to six digits ] whitespace
(?:\b(?:km|ca|U\.S)\.|.)*? - zero or more, but as few as possible, whole words km, ca, or U.S a . char, or any char (but a line break char)
(?=\.\s) - a positive lookahead that requires a . char a whitespace immediately to the right of the current location.