Context
I got a vector a
.
a = 'TITLE Distribution and characteristics of Beilong virus among wild
rodents and shrews in China
JOURNAL Infect. Genet. Evol. 85, 104454 (2020)
messy text
messy text
messy text
messy text
TITLE Direct Submission
JOURNAL Submitted (24-FEB-2020) State Key Laboratory of Pathogen and
Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20
Dong-Da Street, Fengtai District, Beijing 100071, China
messy text
messy text'
Question
I want to extract TITLE_1
, JOURNAL_1
, TITLE_2
and JOURNAL_2
. It should like this:
TITLE_1 = 'Distribution and characteristics of Beilong virus among wild
rodents and shrews in China'
JOURNAL_1 = 'Infect. Genet. Evol. 85, 104454 (2020)'
TITLE_2 = 'Direct Submission'
JOURNAL_2 = 'Submitted (24-FEB-2020) State Key Laboratory of Pathogen and
Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20
Dong-Da Street, Fengtai District, Beijing 100071, China'
It's easy to extract JOURNAL_1
and TITLE_2
.
JOURNAL_1 = str_extract(a, '(?<=JOURNAL ).*')
[1] "Infect. Genet. Evol. 85, 104454 (2020)"
TITLE_2 = str_extract(a, '(?<=TITLE ).*')
[1] "Distribution and characteristics of Beilong virus among wild"
But I don't know how to extract JOURNAL_2
and TITLE_1
.
What I've done
# All failed
JOURNAL_2 = str_extract(a, regex('(?<=JOURNAL ).*(?=messy text)', dotall = T))
[1] "Infect. Genet. Evol. 85, 104454 (2020)\n messy text\n messy text\n messy text\n messy text\n TITLE Direct Submission\n JOURNAL Submitted (24-FEB-2020) State Key Laboratory of Pathogen and\n Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20\n Dong-Da Street, Fengtai District, Beijing 100071, China\n messy text\n "
TITLE_1 = str_extract(a, regex('(?<=TITLE ).*(?=JOURNAL)', dotall = T))
[1] "Distribution and characteristics of Beilong virus among wild\n rodents and shrews in China\n JOURNAL Infect. Genet. Evol. 85, 104454 (2020)\n messy text\n messy text\n messy text\n messy text\n TITLE Direct Submission\n "
CodePudding user response:
Here's a tidy approach:
library(tidyverse)
data.frame(a) %>%
# clean up string:
mutate(a = gsub("\\s ", " ", a)) %>%
# separate elements of interest into rows:
separate_rows(a, sep = "\\s(?=[A-Z]{2,})") %>%
# split string:
mutate(a = str_split(a, "(?<=[A-Z]{2,10})\\s", simplify = TRUE))
# A tibble: 4 × 1
a[,1] [,2]
<chr> <chr>
1 TITLE Distribution and characteristics of Beilong virus among wild rodents and shrews in China
2 JOURNAL Infect. Genet. Evol. 85, 104454 (2020) messy text messy text messy text messy text
3 TITLE Direct Submission
4 JOURNAL Submitted (24-FEB-2020) State Key Laboratory of Pathogen and Biosecurity, Beijing Insti…
CodePudding user response:
This solution is a bit dirty, but it works well.
library(stringr)
a = 'TITLE Distribution and characteristics of Beilong virus among wild
rodents and shrews in China
JOURNAL Infect. Genet. Evol. 85, 104454 (2020)
messy text
messy text
messy text
messy text
TITLE Direct Submission
JOURNAL Submitted (24-FEB-2020) State Key Laboratory of Pathogen and
Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20
Dong-Da Street, Fengtai District, Beijing 100071, China
messy text
messy text'
A <-
a |>
str_split("TITLE\\s |JOURNAL\\s ") |>
unlist()
TITLE_1 <-
A[2] |>
str_replace("\\n\\s*", " ") |>
str_replace("\\n", "")
JOUNAL_1 <-
A[3] |> str_extract("(^. )\\n") |>
str_replace("\\n\\s*", "")
TITLE_2 <-
A[4] |>
str_replace("\\n\\s*", "")
JOUNAL_2 <-
A[5] |>
str_replace("\\n\\s*", " ") |>
str_replace("\\n\\s*", " ") |>
str_extract("^. China")
TITLE_1
#> [1] "Distribution and characteristics of Beilong virus among wild rodents and shrews in China "
JOUNAL_1
#> [1] "Infect. Genet. Evol. 85, 104454 (2020)"
TITLE_2
#> [1] "Direct Submission"
JOUNAL_2
#> [1] "Submitted (24-FEB-2020) State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20 Dong-Da Street, Fengtai District, Beijing 100071, China"
Created on 2022-09-10 with reprex v2.0.2