I am trying to analyze an earnings call using python regular expression. I want to delete unnecessary lines which only contain the name and position of the person, who is speaking next.
This is an excerpt of the text I want to analyze:
"Questions and Answers\nOperator [1]\n\n Shannon Siemsen Cross, Cross Research LLC - Co-Founder, Principal & Analyst [2]\n I hope everyone is well. Tim, you talked about seeing some improvement in the second half of April. So I was wondering if you could just talk maybe a bit more on the segment and geographic basis what you're seeing in the various regions that you're selling in and what you're hearing from your customers. And then I have a follow-up.\n Timothy D. Cook, Apple Inc. - CEO & Director [3]\n ..."
At the end of each line that I want to delete, you have [some number].
So I used the following line of code to get these lines:
name_lines = re.findall('.*[\d]]', text)
This works and gives me the following list: ['Operator [1]', ' Shannon Siemsen Cross, Cross Research LLC - Co-Founder, Principal & Analyst [2]', ' Timothy D. Cook, Apple Inc. - CEO & Director [3]']
So, now in the next step I want to replace this strings in the text using the following line of code:
for i in range(0,len(name_lines)):
text = re.sub(name_lines[i], '', text)
But this does not work. Also if I just try to replace 1 instead of using the loop it does not work, but I have no clue why.
Also if I try now to use re.findall and search for the lines I obtained from the first line of code I don`t get a match.
CodePudding user response:
Try to use re.sub
to replace the match:
import re
text = """\
Questions and Answers
Operator [1]
Shannon Siemsen Cross, Cross Research LLC - Co-Founder, Principal & Analyst [2]
I hope everyone is well. Tim, you talked about seeing some improvement in the second half of April. So I was wondering if you could just talk maybe a bit more on the segment and geographic basis what you're seeing in the various regions that you're selling in and what you're hearing from your customers. And then I have a follow-up.
Timothy D. Cook, Apple Inc. - CEO & Director [3]"""
text = re.sub(r".*\d]", "", text)
print(text)
Prints:
Questions and Answers
I hope everyone is well. Tim, you talked about seeing some improvement in the second half of April. So I was wondering if you could just talk maybe a bit more on the segment and geographic basis what you're seeing in the various regions that you're selling in and what you're hearing from your customers. And then I have a follow-up.
CodePudding user response:
The first argument to re.sub
is treated as a regular expression, so the square brackets get a special meaning and don't match literally.
You don't need a regular expression for this replacement at all though (and you also don't need the loop counter i
):
for name_line in name_lines:
text = text.replace(name_line, '')