- File: Plain Text Document
- Content: Youtube timestamped transcript
I can separately remove each line's timestamp:
for count, line in enumerate(content, start=1):
if count % 2 == 0:
s = line.replace('\n','')
print(s)
I can also join the sentences if I don't remove the timestamps:
with open('file.txt') as f:
print (" ".join(line.strip() for line in f))
But I attempted to do these together (removing timestamps and joining the lines) in various formats but no right outcome:
with open('Russell Brand Script.txt') as m:
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n',' ')
print(" ".join(sentence.rstrip('\n')))
I also tried various form of print(" ".join(sentence.rstrip('\n')))
and print(" ".join(sentence.strip()))
but the results is always either of below:
How can I remove the timestamps and join the sentences to create a paragraph at once?
CodePudding user response:
Whenever you call .join()
on a string, it inserts the separator between every character of the string. You should also note that print()
, by default, adds a newline after the string is printed.
To get around this, you can save each modified sentence to a list, and then output the entire paragraph at once at the end using "".join()
. This gets around the newline issue described above, and gives you the ability to do additional processing on the paragraph afterwards, if desired.
with open('put_your_filename_here.txt') as m:
sentences = []
for count, line in enumerate(m, start=1):
if count % 2 == 0:
sentence=line.replace('\n', '')
sentences.append(sentence)
print(' '.join(sentences))
(Made a small edit to the code -- the old version of the code produced a trailing space after the paragraph.)
CodePudding user response:
TL;DR: copy-paste solution using list-comprehension with if as filter and regex to match timestamp:
' '.join([line.strip() for line in transcript if not re.match(r'\d{2}:\d{2}', line)])
.
Explained
Suppose your text input given is:
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
Then you can ignore the timestamps with regex \d{2}:\d{2}
and append
all filtered lines as phrase to a list. Trim each phrase using strip()
which removes heading/trailing whitespace. But when you finally join
all phrases to a paragraph use a space as delimiter:
import re
def to_paragraph(transcript_lines):
phrases = []
for line in transcript_lines:
trimmed = line.strip()
if trimmed != '' and not re.matches(r'\d{2}:\d{2}', trimmed):
phrases.append(trimmed)
else: # TODO: for debug only, remove
print(line) # TODO: for debug only, remove
return " ".join(phrases)
t = '''
00:00
merry christmas it's our christmas video
00:03
to you i already regret this hat but if
00:05
we got some fantastic content for you a
00:07
look at the most joyous and wonderful
00:09
aspects have a very merry year ho ho ho
'''
paragraph = to_paragraph(t.splitlines())
print(paragraph)
with open('put_your_filename_here.txt') as f:
print(to_paragraph(f.readlines())
Outputs:
00:00
00:03
00:05
00:07
00:09
('result:', "merry christmas it's our christmas video to you i already regret this hat but if we got some fantastic content for you a look at the most joyous and wonderful aspects have a very merry year ho ho ho")
Result is same as youtubetranscript.com returned for the given youtube video.