Home > Software design >  How to extract substring between periods in R
How to extract substring between periods in R

Time:12-06

I need to create a dataframe from a .csv file containing author references:

refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")

Essentially I want to pull out the coauthors, year of publication, and article title.

refs$author[1]

Harris P R, Harris D L

refs$year[1]
1983

refs$title[1]
Training for the Metaindustrial Work Culture

At this stage, I do not need a publication source as I can get this via rscopus.

I can extract authors and years with this code:

refs <- refs %>%
mutate(author = sub("\\(.*", "", reference),
       year = str_extract(reference, "\\d{4}")))

However, I need help extracting the title (substring between two periods after bracketed date).

CodePudding user response:

This regex works for your minimal example:

refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")
sub("[^.] \\.([^.] )\\..*", "\\1", refs$reference)
#> [1] " Training for the Metaindustrial Work Culture"

Explanation:

"[^.] \\.([^.] )\\..*" - whole regex

[^.] \\. - one or more characters that isn't a period, followed by a period (i.e. everything up until the first period)

([^.] )\\..* - start capturing 'group 1' "(" which contains one or more characters that aren't a period ([^.] ) then stop capturing group 1 ")" at the next period "\\." (group 1 now = the title), then match everything else ".*"

Then, in the sub command, you print group 1 ("\\1").

Unfortunately, you may run into problems with your 'real world' data. Using rscopus to extract the title might be a better solution to avoid unforeseen errors.


Using tidyverse functions:

library(tidyverse)

refs <- data.frame(reference = "Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.")

refs %>%
  mutate(author = sub("\\(.*", "", reference),
         year = str_extract(reference, "\\d{4}"),
         title = sub("[^.] \\.([^.] )\\..*", "\\1", reference))
#>                                                                                                                         reference
#> 1 Harris P R, Harris D L (1983). Training for the Metaindustrial Work Culture. Journal of European Industrial Training, 7(7): 22.
#>                    author year                                         title
#> 1 Harris P R, Harris D L  1983  Training for the Metaindustrial Work Culture

Created on 2022-12-05 with reprex v2.0.2

  • Related