I have strings that looks like
docs = ['Hi, my name is Eric. Are you blue?',
"Hi, I'm ! What is your name?",
'This is a great idea. I would love to go.',
'Hello, I am Jane Brown. What is your name?',
"Hello, I am a doctor! Let's go to the mall.",
'I am ready to go. Mom says hello.']
I want to remove the first sentence of a string if it starts with 'Hi' or 'Hello'.
Desired Output:
docs = ['Are you blue?',
'What is your name?',
'This is a great idea. I would love to go.',
'What is your name?',
"Let's go to the mall."
'I am ready to go. Mom says hello.']
The regex I have is
re.match('.*?[a-z0-9][.?!](?= )', x))
But this only give be the first sentence in weird format like:
<re.Match object; span=(0, 41), match='Hi, my name is Eric.'>
What can I do to get my desired output? Thanks"
CodePudding user response:
You can use
docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s ', '', doc) for doc in docs]
See the regex demo. Details:
^
- start of stringH(?:ello|i)\b
-Hello
orHi
word (\b
is a word boundary).*?
- any zero or more chars other than line break chars as few as possible[.?!]
- a.
,?
or!
\s
- one or more whitespaces.
See the Python demo:
import re
docs = ['Hi, my name is Eric. Are you blue?',
"Hi, I'm ! What is your name?",
'This is a great idea. I would love to go.',
'Hello, I am Jane Brown. What is your name?',
"Hello, I am a doctor! Let's go to the mall.",
'I am ready to go. Mom says hello.']
docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s ', '', doc) for doc in docs]
print(docs)
Output:
[
'Are you blue?',
'What is your name?',
'This is a great idea. I would love to go.',
'What is your name?',
"Let's go to the mall.",
'I am ready to go. Mom says hello.'
]
CodePudding user response:
You would have to first split the string in sentences
splitted_docs = []
for str in docs:
splitted_docs.append(str.split('.'))
Then, you want to check each sentence for Hi or Hello with your regex and add it to the final array
final_docs = []
for str in splitted_docs:
final_sentence = []
for sentence in str:
if not re.match('.*?[a-z0-9][.?!](?= )', sentence):
final_sentence.append(sentence)
final_docs.append(final_sentence.join('.'))
Actually, your regex is not working, just changed the code to make it work, i goes just like follows:
for str in splitted_docs:
final_sentence = []
for sentence in str:
if not 'Hello' in sentence and not 'Hi' in sentence:
final_sentence.append(sentence)
final_docs.append('.'.join(final_sentence))
Finally, filter your array to remove all the empty strings that may have been created in the process of joining:
final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)
Output:
[' Are you blue?', 'This is a great idea. I would love to go.', ' What is your name?', 'I am ready to go. Mom says hello.']
I'll leave the full code here, any suggestion is welcome, I am sure this can be solved in a more functional approach that may be easier to understand, but I am not familiar with it to such a level.
import re
docs = ['Hi, my name is Eric. Are you blue?',
"Hi, I'm ! What is your name?",
'This is a great idea. I would love to go.',
'Hello, I am Jane Brown. What is your name?',
"Hello, I am a doctor! Let's go to the mall.",
'I am ready to go. Mom says hello.']
splitted_docs = []
for str in docs:
splitted_docs.append(str.split('.'))
final_docs = []
for str in splitted_docs:
final_sentence = []
for sentence in str:
if not 'Hello' in sentence and not 'Hi' in sentence:
final_sentence.append(sentence)
final_docs.append('.'.join(final_sentence))
final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)
CodePudding user response:
pattern = r'(Hi,||Hello) ^(.*?)(\.)'
for string in docs:
mod_string = re.sub(pattern,'',string)
print(mod_string)