I am trying to extract some phrases (localities) from some strings, and have put several examples below. The function searches the text after this first ref and would go to the first .
, but ignoring cases like km.
, ca.
and U.S.
(the latter for when it is U.S.A.).
However, the ca within the pattern in the function ends up ignoring the a.
in Colombia.
in the first example. If I test with Colombie
, it works as expected.
All the others are working, only the first example doesn't work yet.
library(dplyr)
library(stringr)
places <- tibble(
text = c(
"versity Studies No. 19; ref. 1259] Villavicencio, Río Orinoco system, Colombia. Holotype: FMNH 56646 [ex CM 5463]. Paratypes: (6) ",
"versity Studies No. 19; ref. 1259] Villavicencio, Río Orinoco system, Colombie. Holotype: FMNH 56646 [ex CM 5463]. Paratypes: (6) ",
"ions No. 457 (art. 17); ref. 2247] Roadside pool, 3 kilometers south of Progreso, Yucatán, southeastern Mexico. Holotype: UMMZ 102144. Paratypes: UMMZ 102145 (3). •Syn",
"for 2001); ref. 27036] La Sapayera stream, near the bridge km. 2 vía Guadalupe-Florencia, 1°59'27\"N, 75°45'08\"W, afluent of La Viciosa stream, río Suaza system, upper Magdalena, Cachimba trail, Municipality of Guadalupe, Department of Huila, Colombia, elevation 952 meters. Holotype: IUQ 422. Par",
"); ref. 30495] Rio Iguaçu, ca. 25°43'S, 49°44'W, Serrinha, Paraná, Brazil. Holotype: FMN",
". 4 (no. 2) (art. 7); ref. 1261] Corumba, río Otuquis, Ascunción, Paraguay. Holo",
"); ref. 14367] Not: Puerto Mexico [Coatzacoalcos], Veracruz State, Isthmus of Tehuantepec, Mexico [probably Panama]. Holotype",
"hia v. 7; ref. 168] Upper tributaries of Nueces River, Texas, U.S.A. Lectotype: BMNH 1883.12.14.107. Paralectotype"
)
)
places %>%
mutate(
type_locality = stringr::str_extract(text, "(?<=ref.\\s[:digit:]{1,6}\\]\\s).*?[^\\s(km|ca|U\\.S\\.)\\.](?=\\.\\s)")) %>%
relocate(type_locality)
#> # A tibble: 8 x 2
#> type_locality text
#> <chr> <chr>
#> 1 "Villavicencio, Río Orinoco system, Colombia. Holotype: FMNH 56646 [ex ~ "ver~
#> 2 "Villavicencio, Río Orinoco system, Colombie" "ver~
#> 3 "Roadside pool, 3 kilometers south of Progreso, Yucatán, southeastern M~ "ion~
#> 4 "La Sapayera stream, near the bridge km. 2 vía Guadalupe-Florencia, 1°5~ "for~
#> 5 "Rio Iguaçu, ca. 25°43'S, 49°44'W, Serrinha, Paraná, Brazil" "); ~
#> 6 "Corumba, río Otuquis, Ascunción, Paraguay" ". 4~
#> 7 "Not: Puerto Mexico [Coatzacoalcos], Veracruz State, Isthmus of Tehuant~ "); ~
#> 8 "Upper tributaries of Nueces River, Texas, U.S.A" "hia~
Created on 2022-12-19 with reprex v2.0.2
CodePudding user response:
You can use
stringr::str_extract(text, "(?<=ref\\.\\s[0-9]{1,6}]\\s)(?:\\b(?:km|ca|U\\.S)\\.|.)*?(?=\\.\\s)")
See the regex demo.
Details:
(?<=ref\.\s[0-9]{1,6}]\s)
- a positive lookbehind that matches a location that is immediately preceded withref.
whitespace one to six digits]
whitespace(?:\b(?:km|ca|U\.S)\.|.)*?
- zero or more, but as few as possible, whole wordskm
,ca
, orU.S
a.
char, or any char (but a line break char)(?=\.\s)
- a positive lookahead that requires a.
char a whitespace immediately to the right of the current location.