Home > Software engineering >  Remove transcript timestamps and join the lines to make paragraph
Remove transcript timestamps and join the lines to make paragraph

Time:12-28

  • File: Plain Text Document
  • Content: Youtube timestamped transcript

enter image description here

I can separately remove each line's timestamp:

for count, line in enumerate(content, start=1):
        if count % 2 == 0:
            s = line.replace('\n','')
            print(s) 

I can also join the sentences if I don't remove the timestamps:

with open('file.txt') as f:
    print (" ".join(line.strip() for line in f))

But I attempted to do these together (removing timestamps and joining the lines) in various formats but no right outcome:

with open('Russell Brand Script.txt') as m:
    for count, line in enumerate(m, start=1):
        if count % 2 == 0:
            sentence=line.replace('\n',' ')
            print(" ".join(sentence.rstrip('\n'))) 

I also tried various form of print(" ".join(sentence.rstrip('\n'))) and print(" ".join(sentence.strip())) but the results is always either of below:

enter image description here

How can I remove the timestamps and join the sentences to create a paragraph at once?

CodePudding user response:

Whenever you call .join() on a string, it inserts the separator between every character of the string. You should also note that print(), by default, adds a newline after the string is printed.

To get around this, you can save each modified sentence to a list, and then output the entire paragraph at once at the end using "".join(). This gets around the newline issue described above, and gives you the ability to do additional processing on the paragraph afterwards, if desired.

with open('put_your_filename_here.txt') as m:
    sentences = []
    for count, line in enumerate(m, start=1):
        if count % 2 == 0:
            sentence=line.replace('\n', '')
            sentences.append(sentence)
    print(' '.join(sentences))

(Made a small edit to the code -- the old version of the code produced a trailing space after the paragraph.)

CodePudding user response:

TL;DR: copy-paste solution using list-comprehension with if as filter and regex to match timestamp: ' '.join([line.strip() for line in transcript if not re.match(r'\d{2}:\d{2}', line)]).

Explained

Suppose your text input given is:

00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho

Then you can ignore the timestamps with regex \d{2}:\d{2} and append all filtered lines as phrase to a list. Trim each phrase using strip() which removes heading/trailing whitespace. But when you finally join all phrases to a paragraph use a space as delimiter:

import re

def to_paragraph(transcript_lines):
        phrases = []  
        for line in transcript_lines:
            trimmed = line.strip()
            if trimmed != '' and not re.matches(r'\d{2}:\d{2}', trimmed):
                phrases.append(trimmed)
            else:  # TODO: for debug only, remove
                print(line)  # TODO: for debug only, remove
        return " ".join(phrases) 

t = '''
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
'''

paragraph = to_paragraph(t.splitlines())
print(paragraph)

with open('put_your_filename_here.txt') as f:
     print(to_paragraph(f.readlines())

Outputs:


00:00
00:03
00:05
00:07
00:09
('result:', "merry christmas it's our christmas video to you i already regret this hat but if we got some fantastic content for you a look at the most joyous and wonderful aspects have a very merry year ho ho ho")

Result is same as youtubetranscript.com returned for the given youtube video.

  • Related