Separate title and subtitle in a string by CamelCase in R-CodePudding

I have scraped a list of titles, some of them have subtitles. Unfortunately, whenever there is a subtitle it is pasted to the title (like paste0()). How can I separate the two in R? I am thinking of some regex since the pattern is a CamelCase indicates the subtitle, like this:

data <- data.frame(title = "Bilder aus dem LebenWie man Universalerbe wird")

result <- data.frame(title = "Bilder aus dem Leben",
                     subtitle = "Wie man Universalerbe wird")

CodePudding user response：

A naive regex can look for a lower-case followed by an upper-case,

strcapture("^(. [a-z])([A-Z]. )", data$title, proto = list(title = "", subtitle = ""))
#                  title                   subtitle
# 1 Bilder aus dem Leben Wie man Universalerbe wird

CodePudding user response：

With tidyr's (new) separate_wider_regex:

library(tidyr)
separate_wider_regex(data, title, c(title = "^. [a-z]", subtitle = "[A-Z]. "))

#  title                subtitle                                
#1 Bilder aus dem Leben Wie man Universalerbe wird

This is equivalent to the superseded extract:

extract(data, title, c("title", "subtitle"), "^(. [a-z])([A-Z]. )")

CodePudding user response：

You can use separate from tidyr:

library(tidyverse)
data %>%
  separate(title, into = c("title", "subtitle"), sep = "(?<=[a-z])(?=[A-Z])")
                 title                   subtitle
1 Bilder aus dem Leben Wie man Universalerbe wird

sep here uses two look-arounds to define the split point:

(?<=[a-z]): positive look-behind to assert that on the left of the split point there must be a lower-case letter, and
(?=[A-Z]): a positive look-ahead to assert that on the right of the split point there must be an upper-case letter