I have a text file that is a transcription of a podcast. This is what the content I have to work with looks like, it's composed of a host and a guest with timecodes:
Bill Jacobs 7:22
I love that. I love thinking about it that way, that just keep an eye out for things that people say and be like, is that supported by anything? Because a lot of times it's not. A lot of times it's not. And we take it for granted.
Andy Ward 7:37
Yeah, the one we're most famous for is like blogging takes time. Oh, really? How much time does it take to write a blog post? That's when we have to reach out to 1000 bloggers to get them to fill out a survey. There's literally no other way to have to answer the question. You have to ask people how long they spend writing and then average a bunch of answers together. So the answer is four hours. And we know that because we've asked for seven years in a row. We've asked 1000 bloggers how long they spend writing blog posts, and that's the average.
Bill Jacobs 8:06
That's the average. Okay, so that's interesting. Let's talk about that. Let's talk about that first in the context of what's a good bounce rate, right? In terms of the approach that you have to research, right? How did you, you mentioned you went into the the analytics dashboard, and you kind of extracted that information from there. Is there is there any anything else on that front that that you think we should touch on with regards to the extraction of the data?
The change I would like to do is everytime the line starts with a guest or host name, I would need to format the line so it ends up looking like this:
**[7:22] Bill Jacobs** I love that. I love thinking about it that way, that just keep an eye out for things that people say and be like, is that supported by anything? Because a lot of times it's not. A lot of times it's not. And we take it for granted.
The name and timecode need to be switched, the timecode in brackets and both should be boldened (** **), and then the next line should just have a tab space and start from the same line instead.
I already created the for loop that successfully is able to pick up each time either the host or guest name show up in the txt file:
file = open('transcription.txt')
string_list = file.readlines()
host_index = len(host_name)
guest_index = len(guest_name)
for i in range(len(string_list)):
if string_list[i][0:host_index] == host_name:
stripped_line = string_list[i].strip()
line_list = stripped_line.split()
list_of_lists.append(line_list)
I am unsure as to what the most optimal way is to make the specific edits to the corresponding lines and write them back into the text file. Any advice would be greatly appreciated.
CodePudding user response:
>>> cat file.txt
Bill Jacobs 7:22
I love that. I love thinking about it that way, that just keep an eye out for things that people say and be like, is that supported by anything? Because a lot of times it's not. A lot of times it's not. And we take it for granted.
Andy Ward 7:37
Yeah, the one we're most famous for is like blogging takes time. Oh, really? How much time does it take to write a blog post? That's when we have to reach out to 1000 bloggers to get them to fill out a survey. There's literally no other way to have to answer the question. You have to ask people how long they spend writing and then average a bunch of answers together. So the answer is four hours. And we know that because we've asked for seven years in a row. We've asked 1000 bloggers how long they spend writing blog posts, and that's the average.
Bill Jacobs 8:06
That's the average. Okay, so that's interesting. Let's talk about that. Let's talk about that first in the context of what's a good bounce rate, right? In terms of the approach that you have to research, right? How did you, you mentioned you went into the the analytics dashboard, and you kind of extracted that information from there. Is there is there any anything else on that front that that you think we should touch on with regards to the extraction of the data?
>>> lst = []
>>> file = open("file.txt", 'r').read().splitlines()
>>> for i in range(0, len(file), 3):
... x = file[i:i 3]
... *name, time = x[0].split()
... lst.append(f"**[{time}] {' '.join(name)}** {x[1]}" )
>>> open("file_modified.txt", 'w').write("\n\n".join(lst))
>>> cat file_modified.txt
**[7:22] Bill Jacobs** I love that. I love thinking about it that way, that just keep an eye out for things that people say and be like, is that supported by anything? Because a lot of times it's not. A lot of times it's not. And we take it for granted.
**[7:37] Andy Ward** Yeah, the one we're most famous for is like blogging takes time. Oh, really? How much time does it take to write a blog post? That's when we have to reach out to 1000 bloggers to get them to fill out a survey. There's literally no other way to have to answer the question. You have to ask people how long they spend writing and then average a bunch of answers together. So the answer is four hours. And we know that because we've asked for seven years in a row. We've asked 1000 bloggers how long they spend writing blog posts, and that's the average.
**[8:06] Bill Jacobs** That's the average. Okay, so that's interesting. Let's talk about that. Let's talk about that first in the context of what's a good bounce rate, right? In terms of the approach that you have to research, right? How did you, you mentioned you went into the the analytics dashboard, and you kind of extracted that information from there. Is there is there any anything else on that front that that you think we should touch on with regards to the extraction of the data?
CodePudding user response:
If we can rely on the file spacing then w can do:
string_list_split = string_list.split("\n")
with open("modified_file.txt",'w') as file:
for i, timestamp in enumerate(string_list_split[0::3]):
*name, time = timestamp.split()
file.write((f'**[{time}] {" ".join(name)}** {string_list_split[i*3 1]})'))