I have some txt
files which were originally srt
's, the format subtitles are published.
The pattern they usually follow is like the following:
Subtitle_number
Beginning_min --> Ending_min
Text
As an example, this might be the structure of an srt
file:
1
00:00:00,100 --> 00:00:01,500
This is the first subtitle
2
00:00:01,700 --> 00:00:02,300
of the movie
Now, I have some "modified" srt
's, which differ from normal ones because of them having the name of the character right after the subtitle number. Here is an example:
1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt
2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas
What I would like to do is to parse these files in order to create a data.frame
like the following:
---------------------------------------------
| CHARACTER | TEXT |
|-------------- ------------------------------|
| Matt | This is said by Matt |
|-------------- ------------------------------|
| Lucas | While this is said by Lucas |
---------------------------------------------
So, I do not want the number or the minute of the subtitle.
I have been able to read the text with the readtext
library, resulting in something like this:
1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas
Note that there might be \n
also inside of the texts, as well as any other (readable) character
Here is where I am stuck, I guess I would have to use some kind of Regex
to extract all names and then all texts, but I have no clue on how to do this.
Any help is highly appreciated!
CodePudding user response:
Here is a step-by-step way to do this without regex. It's a bit sloppy, but its to show the logic on how to approach a file like this. End result is a data frame where you can grab the info you want.
txt <- "1 Matt
00:00:00,100 --> 00:00:01,500
This is said by Matt
2 Lucas
00:00:01,700 --> 00:00:02,300
While this is said by Lucas\nand another line
3
00:00:01,700 --> 00:00:02,300
While this is said by nobody"
library(readr)
library(tidyr)
library(tibble)
library(dplyr)
library(purrr)
df <- tibble(txt = read_lines(txt))
df %>%
rowid_to_column("row") %>%
group_by(group = cumsum(txt == "")) %>%
filter(!(txt == "")) %>%
mutate(field = pmin(row_number(), 3)) %>%
group_by(group, field) %>%
summarize(txt = paste(txt, collapse = "\n"), .groups = "drop") %>%
pivot_wider(names_from = "field",
values_from = "txt") %>%
select(-group) %>%
set_names(c("Col1", "Col2", "Col3")) %>%
separate(Col1, c("Col1A", "Col1B"), extra = "merge", fill = "right")
And you get this data frame. You can name things whatever you want, of course.
# A tibble: 3 x 4
Col1A Col1B Col2 Col3
<chr> <chr> <chr> <chr>
1 1 Matt 00:00:00,100 --> 00:00:01,500 "This is said by Matt"
2 2 Lucas 00:00:01,700 --> 00:00:02,300 "While this is said by Lucas\nand another line"
3 3 NA 00:00:01,700 --> 00:00:02,300 "While this is said by nobody"
EDIT
Here is a more streamlined way using a bit of tidyverse.
library(tidyr)
library(dplyr)
tibble(txt = txt) %>%
separate_rows(txt, sep = "\\n\\n") %>%
separate(txt, c("A", "B", "C"), sep = "\n", extra = "merge") %>%
separate(A, c("A1", "B2"), extra = "merge", fill = "right")
CodePudding user response:
You are right that you can use regular expressions to try and accomplish this. Using the stringr
package is usually a good idea for this. It highly depends on how consistent your texts are, but this works for your example. It might not work if there are exceptions to the rule, but you can tweak the patterns. Using regex101 is a great help.
After your feedback I think splitting the text into chunks first using strsplit
makes it easier to process. Then using dplyr
and stringr
:
library(dplyr)
input_string <- "1 Matt\n00:00:00,100 --> 00:00:01,500\nThis is said by Matt.\n\n
2 Lucas\n00:00:01,700 --> 00:00:02,300\nWhile this is said by Lucas:\n Hi I'm Lucas\n Lucas is my name\n\n
1237 VvdL\n00:00:02,701 --> 00:00:02,900\nI'm\nHappy\nThis\nSeems\nTo\nWork"
tmp <- strsplit(input_string, split = '\\n\\n', perl = T) %>%
data.frame
colnames(tmp) <- "full_line"
tmp %>%
mutate(CHARACTER = stringr::str_extract_all(full_line, "(?<=\\d )[a-zA-Z] ?(?=\\n)"),
TEXT = stringr::str_extract_all(full_line, "(?<=\\d\\n)(.|\\s)*")) %>%
select(CHARACTER, TEXT)
CHARACTER TEXT
1 Matt This is said by Matt.
2 Lucas While this is said by Lucas:\n Hi I'm Lucas\n Lucas is my name
3 VvdL I'm\nHappy\nThis\nSeems\nTo\nWork