I have scraped a list of titles, some of them have subtitles. Unfortunately, whenever there is a subtitle it is pasted to the title (like paste0()
). How can I separate the two in R? I am thinking of some regex
since the pattern is a CamelCase indicates the subtitle, like this:
data <- data.frame(title = "Bilder aus dem LebenWie man Universalerbe wird")
result <- data.frame(title = "Bilder aus dem Leben",
subtitle = "Wie man Universalerbe wird")
CodePudding user response:
A naive regex can look for a lower-case followed by an upper-case,
strcapture("^(. [a-z])([A-Z]. )", data$title, proto = list(title = "", subtitle = ""))
# title subtitle
# 1 Bilder aus dem Leben Wie man Universalerbe wird
CodePudding user response:
With tidyr's (new) separate_wider_regex
:
library(tidyr)
separate_wider_regex(data, title, c(title = "^. [a-z]", subtitle = "[A-Z]. "))
# title subtitle
#1 Bilder aus dem Leben Wie man Universalerbe wird
This is equivalent to the superseded extract
:
extract(data, title, c("title", "subtitle"), "^(. [a-z])([A-Z]. )")
CodePudding user response:
You can use separate
from tidyr
:
library(tidyverse)
data %>%
separate(title, into = c("title", "subtitle"), sep = "(?<=[a-z])(?=[A-Z])")
title subtitle
1 Bilder aus dem Leben Wie man Universalerbe wird
sep
here uses two look-arounds to define the split point:
(?<=[a-z])
: positive look-behind to assert that on the left of the split point there must be a lower-case letter, and(?=[A-Z])
: a positive look-ahead to assert that on the right of the split point there must be an upper-case letter