Home > Back-end >  How to extract specific characters in characters with repeating anchor points in R using str_extract
How to extract specific characters in characters with repeating anchor points in R using str_extract

Time:09-11

Context

I got a vector a.

a = 'TITLE     Distribution and characteristics of Beilong virus among wild
            rodents and shrews in China
     JOURNAL   Infect. Genet. Evol. 85, 104454 (2020)
       messy text
       messy text
       messy text
       messy text
     TITLE     Direct Submission
     JOURNAL   Submitted (24-FEB-2020) State Key Laboratory of Pathogen and
               Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20
               Dong-Da Street, Fengtai District, Beijing 100071, China
       messy text
       messy text'

Question

I want to extract TITLE_1, JOURNAL_1, TITLE_2 and JOURNAL_2. It should like this:

TITLE_1 =  'Distribution and characteristics of Beilong virus among wild
            rodents and shrews in China'

JOURNAL_1 = 'Infect. Genet. Evol. 85, 104454 (2020)'

TITLE_2 = 'Direct Submission'

JOURNAL_2 = 'Submitted (24-FEB-2020) State Key Laboratory of Pathogen and
               Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20
               Dong-Da Street, Fengtai District, Beijing 100071, China'

It's easy to extract JOURNAL_1 and TITLE_2.

JOURNAL_1 = str_extract(a, '(?<=JOURNAL   ).*')
[1] "Infect. Genet. Evol. 85, 104454 (2020)"

TITLE_2 = str_extract(a, '(?<=TITLE     ).*')
[1] "Distribution and characteristics of Beilong virus among wild"

But I don't know how to extract JOURNAL_2 and TITLE_1.

What I've done

# All failed
JOURNAL_2 = str_extract(a, regex('(?<=JOURNAL   ).*(?=messy text)', dotall = T))
[1] "Infect. Genet. Evol. 85, 104454 (2020)\n       messy text\n       messy text\n       messy text\n       messy text\n     TITLE     Direct Submission\n     JOURNAL   Submitted (24-FEB-2020) State Key Laboratory of Pathogen and\n               Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20\n               Dong-Da Street, Fengtai District, Beijing 100071, China\n       messy text\n       "

TITLE_1 = str_extract(a, regex('(?<=TITLE     ).*(?=JOURNAL)', dotall = T))
[1] "Distribution and characteristics of Beilong virus among wild\n            rodents and shrews in China\n     JOURNAL   Infect. Genet. Evol. 85, 104454 (2020)\n       messy text\n       messy text\n       messy text\n       messy text\n     TITLE     Direct Submission\n     "

CodePudding user response:

Here's a tidy approach:

library(tidyverse)
data.frame(a) %>%
  # clean up string:
  mutate(a = gsub("\\s ", " ", a)) %>%
  # separate elements of interest into rows:
  separate_rows(a, sep = "\\s(?=[A-Z]{2,})") %>%
  # split string:
  mutate(a = str_split(a, "(?<=[A-Z]{2,10})\\s", simplify = TRUE))
# A tibble: 4 × 1
  a[,1]   [,2]                                                                                    
  <chr>   <chr>                                                                                   
1 TITLE   Distribution and characteristics of Beilong virus among wild rodents and shrews in China
2 JOURNAL Infect. Genet. Evol. 85, 104454 (2020) messy text messy text messy text messy text      
3 TITLE   Direct Submission                                                                       
4 JOURNAL Submitted (24-FEB-2020) State Key Laboratory of Pathogen and Biosecurity, Beijing Insti…

CodePudding user response:

This solution is a bit dirty, but it works well.

library(stringr)
a = 'TITLE     Distribution and characteristics of Beilong virus among wild
            rodents and shrews in China
     JOURNAL   Infect. Genet. Evol. 85, 104454 (2020)
       messy text
       messy text
       messy text
       messy text
     TITLE     Direct Submission
     JOURNAL   Submitted (24-FEB-2020) State Key Laboratory of Pathogen and
               Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20
               Dong-Da Street, Fengtai District, Beijing 100071, China
       messy text
       messy text'

A <- 
a |>
  str_split("TITLE\\s |JOURNAL\\s ") |>
  unlist()

TITLE_1 <- 
A[2] |>
  str_replace("\\n\\s*", " ") |>
  str_replace("\\n", "")

JOUNAL_1 <- 
A[3] |> str_extract("(^. )\\n") |> 
  str_replace("\\n\\s*", "")

TITLE_2 <- 
A[4] |>
  str_replace("\\n\\s*", "")

JOUNAL_2 <- 
A[5] |>
  str_replace("\\n\\s*", " ") |>
  str_replace("\\n\\s*", " ") |>
  str_extract("^. China")

TITLE_1
#> [1] "Distribution and characteristics of Beilong virus among wild rodents and shrews in China     "
JOUNAL_1
#> [1] "Infect. Genet. Evol. 85, 104454 (2020)"
TITLE_2
#> [1] "Direct Submission"
JOUNAL_2
#> [1] "Submitted (24-FEB-2020) State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, 20 Dong-Da Street, Fengtai District, Beijing 100071, China"

Created on 2022-09-10 with reprex v2.0.2

  • Related