I have a data.frame that contains a column named movies_name. this column contain data as in this format: City of Lost Children, The (Cité des enfants perdus, La) (1995) I want to separate the year from the rest of the movie name without losing the text inside the brackets. to be more precise I want to create a new column holding the year and another one for the movie name alone.
I tried this approach but now I cannot gather back the movie name:
thanks
CodePudding user response:
This looks for a number in round brackets at the end of the string, using stringr
.
data.frame(movies, year = stringr::str_match(movies$movie, "\\((\\d )\\)$")[,2])
movie year
1 City of Lost Children, The (Cité des enfants (2002) perdus, La) (1995) 1995
2 City of Lost Children, The (Cité des enfants perdus, La) (1995) 1995
Data
movies <- structure(list(movie = c("City of Lost Children, The (Cité des enfants (2002) perdus, La) (1995)",
"City of Lost Children, The (Cité des enfants perdus, La) (1995)"
)), row.names = c(NA, -2L), class = "data.frame")
CodePudding user response:
Try the function extract
from tidyr
(part of the tidyverse
):
library(tidyverse)
df %>%
extract(movies_name,
into = c("title", "year"),
regex = "(\\D )\\s\\((\\d )\\)")
title year
1 City of Lost Children, The (Cité des enfants perdus, La) 1995
2 another film 2020
How the regex works:
(\\D )
: first capture group, matching one or more characters that are not digits\\s\\(
: a whitespace and an opening parenthesis (not captured)(\\d )
: second capture group, matching one or more `dìgits\\)
: closing bracket (not captured)
Data 1:
df <- data.frame(
movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
"another film (2020)")
)
EDIT:
Okay, following comment, let's make this a little more complex by including a title with digits (in the title!):
Data 2:
df <- data.frame(
movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
"another film (2020)",
"Under Siege 2: Dark Territory (1995)")
)
Solution - actually easier than the previous one ;)
df %>%
extract(movies_name,
into = c("title", "year"),
regex = "(. )\\s\\((\\d )\\)")
title year
1 City of Lost Children, The (Cité des enfants perdus, La) 1995
2 another film 2020
3 Under Siege 2: Dark Territory 1995