Home > Enterprise >  Parse text into table with R
Parse text into table with R

Time:06-09

I have some txt files which were originally srt's, the format subtitles are published. The pattern they usually follow is like the following:

Subtitle_number
Beginning_min --> Ending_min
Text

As an example, this might be the structure of an srt file:

1
00:00:00,100 --> 00:00:01,500
This is the first subtitle

2
00:00:01,700 --> 00:00:02,300
of the movie

Now, I have some "modified" srt's, which differ from normal ones because of them having the name of the character right after the subtitle number. Here is an example:

1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt

2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas

What I would like to do is to parse these files in order to create a data.frame like the following:

 --------------------------------------------- 
| CHARACTER    |  TEXT                        |
|-------------- ------------------------------|
| Matt         |  This is said by Matt        | 
|-------------- ------------------------------|
| Lucas        |  While this is said by Lucas |
 --------------------------------------------- 

So, I do not want the number or the minute of the subtitle. I have been able to read the text with the readtext library, resulting in something like this:

1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas

Note that there might be \n also inside of the texts, as well as any other (readable) character

Here is where I am stuck, I guess I would have to use some kind of Regex to extract all names and then all texts, but I have no clue on how to do this.

Any help is highly appreciated!

CodePudding user response:

Here is a step-by-step way to do this without regex. It's a bit sloppy, but its to show the logic on how to approach a file like this. End result is a data frame where you can grab the info you want.

txt <- "1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt

2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas\nand another line

3
00:00:01,700 --> 00:00:02,300
While this is said by nobody"

library(readr)
library(tidyr)
library(tibble)
library(dplyr)
library(purrr)

df <- tibble(txt = read_lines(txt))

df %>% 
  rowid_to_column("row") %>% 
  group_by(group = cumsum(txt == "")) %>% 
  filter(!(txt == "")) %>% 
  mutate(field = pmin(row_number(), 3)) %>% 
  group_by(group, field) %>% 
  summarize(txt = paste(txt, collapse = "\n"), .groups = "drop") %>% 
  pivot_wider(names_from = "field",
              values_from = "txt") %>% 
  select(-group) %>% 
  set_names(c("Col1", "Col2", "Col3")) %>% 
  separate(Col1, c("Col1A", "Col1B"), extra = "merge", fill = "right")

And you get this data frame. You can name things whatever you want, of course.

# A tibble: 3 x 4
  Col1A Col1B Col2                          Col3                                           
  <chr> <chr> <chr>                         <chr>                                          
1 1     Matt  00:00:00,100 --> 00:00:01,500 "This is said by Matt"                         
2 2     Lucas 00:00:01,700 --> 00:00:02,300 "While this is said by Lucas\nand another line"
3 3     NA    00:00:01,700 --> 00:00:02,300 "While this is said by nobody"

EDIT

Here is a more streamlined way using a bit of tidyverse.

library(tidyr)
library(dplyr)

tibble(txt = txt) %>% 
  separate_rows(txt, sep = "\\n\\n") %>% 
  separate(txt, c("A", "B", "C"), sep = "\n", extra = "merge") %>% 
  separate(A, c("A1", "B2"), extra = "merge", fill = "right")

CodePudding user response:

You are right that you can use regular expressions to try and accomplish this. Using the stringr package is usually a good idea for this. It highly depends on how consistent your texts are, but this works for your example. It might not work if there are exceptions to the rule, but you can tweak the patterns. Using regex101 is a great help.

After your feedback I think splitting the text into chunks first using strsplit makes it easier to process. Then using dplyr and stringr:

library(dplyr)

input_string <- "1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n
                 2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas:\n Hi I'm Lucas\n Lucas is my name\n\n
                 1237 VvdL\n00:00:02,701 --> 00:00:02,900\nI'm\nHappy\nThis\nSeems\nTo\nWork"

tmp <- strsplit(input_string, split = '\\n\\n', perl = T) %>%
  data.frame 

colnames(tmp) <- "full_line"

tmp %>%
  mutate(CHARACTER = stringr::str_extract_all(full_line, "(?<=\\d )[a-zA-Z] ?(?=\\n)"), 
         TEXT = stringr::str_extract_all(full_line, "(?<=\\d\\n)(.|\\s)*")) %>%
  select(CHARACTER, TEXT)


 CHARACTER                                                           TEXT
1      Matt                                          This is said by Matt.
2     Lucas While this is said by Lucas:\n Hi I'm Lucas\n Lucas is my name
3      VvdL                              I'm\nHappy\nThis\nSeems\nTo\nWork
  • Related