Home > database >  Separating text in r
Separating text in r

Time:01-14

I have a data.frame that contains a column named movies_name. this column contain data as in this format: City of Lost Children, The (Cité des enfants perdus, La) (1995) I want to separate the year from the rest of the movie name without losing the text inside the brackets. to be more precise I want to create a new column holding the year and another one for the movie name alone.

I tried this approach but now I cannot gather back the movie name:

My approach

thanks

CodePudding user response:

This looks for a number in round brackets at the end of the string, using stringr.

data.frame(movies, year = stringr::str_match(movies$movie, "\\((\\d )\\)$")[,2])
                                                                   movie year
1 City of Lost Children, The (Cité des enfants (2002) perdus, La) (1995) 1995
2        City of Lost Children, The (Cité des enfants perdus, La) (1995) 1995

Data

movies <- structure(list(movie = c("City of Lost Children, The (Cité des enfants (2002) perdus, La) (1995)",
"City of Lost Children, The (Cité des enfants perdus, La) (1995)"
)), row.names = c(NA, -2L), class = "data.frame")

CodePudding user response:

Try the function extract from tidyr(part of the tidyverse):

library(tidyverse)    
df %>%
  extract(movies_name,
          into = c("title", "year"), 
          regex = "(\\D )\\s\\((\\d )\\)")
                                                         title year
    1 City of Lost Children, The (Cité des enfants perdus, La) 1995
    2                                             another film 2020

How the regex works:

  • (\\D ): first capture group, matching one or more characters that are not digits
  • \\s\\(: a whitespace and an opening parenthesis (not captured)
  • (\\d ): second capture group, matching one or more `dìgits
  • \\): closing bracket (not captured)

Data 1:

df <- data.frame(
  movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
                  "another film (2020)")
)

EDIT:

Okay, following comment, let's make this a little more complex by including a title with digits (in the title!):

Data 2:

df <- data.frame(
  movies_name = c("City of Lost Children, The (Cité des enfants perdus, La) (1995)",
                  "another film (2020)",
                  "Under Siege 2: Dark Territory (1995)")
)

Solution - actually easier than the previous one ;)

df %>%
  extract(movies_name,
          into = c("title", "year"), 
          regex = "(. )\\s\\((\\d )\\)")
                                                     title year
1 City of Lost Children, The (Cité des enfants perdus, La) 1995
2                                             another film 2020
3                            Under Siege 2: Dark Territory 1995
  • Related