Python remove sentence if it is at start of string and starts with specific words?-CodePudding

I have strings that looks like

docs = ['Hi, my name is Eric. Are you blue?',
        "Hi, I'm ! What is your name?", 
        'This is a great idea. I would love to go.', 
        'Hello, I am Jane Brown. What is your name?', 
        "Hello, I am a doctor! Let's go to the mall.",
        'I am ready to go. Mom says hello.']

I want to remove the first sentence of a string if it starts with 'Hi' or 'Hello'.

Desired Output:

docs = ['Are you blue?',
        'What is your name?', 
        'This is a great idea. I would love to go.', 
        'What is your name?', 
        "Let's go to the mall."
        'I am ready to go. Mom says hello.']

The regex I have is

re.match('.*?[a-z0-9][.?!](?= )', x))

But this only give be the first sentence in weird format like:

<re.Match object; span=(0, 41), match='Hi, my name is Eric.'>

What can I do to get my desired output? Thanks"

CodePudding user response：

You can use

docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s ', '', doc) for doc in docs]

See the regex demo. Details:

^ - start of string
H(?:ello|i)\b - Hello or Hi word (\b is a word boundary)
.*? - any zero or more chars other than line break chars as few as possible
[.?!] - a ., ? or !
\s - one or more whitespaces.

See the Python demo:

import re
docs = ['Hi, my name is Eric. Are you blue?',
        "Hi, I'm ! What is your name?", 
        'This is a great idea. I would love to go.', 
        'Hello, I am Jane Brown. What is your name?', 
        "Hello, I am a doctor! Let's go to the mall.",
        'I am ready to go. Mom says hello.']
docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s ', '', doc) for doc in docs]
print(docs)

Output:

[
    'Are you blue?',
    'What is your name?',
    'This is a great idea. I would love to go.',
    'What is your name?',
    "Let's go to the mall.",
    'I am ready to go. Mom says hello.'
]

CodePudding user response：

You would have to first split the string in sentences

splitted_docs = []
for str in docs:
    splitted_docs.append(str.split('.'))

Then, you want to check each sentence for Hi or Hello with your regex and add it to the final array

final_docs = []
for str in splitted_docs:
    final_sentence = []
    for sentence in str:
        if not re.match('.*?[a-z0-9][.?!](?= )', sentence):
            final_sentence.append(sentence)
    final_docs.append(final_sentence.join('.'))

Actually, your regex is not working, just changed the code to make it work, i goes just like follows:

for str in splitted_docs:

    final_sentence = []
    for sentence in str:
        if not 'Hello' in sentence and not 'Hi' in sentence:
            final_sentence.append(sentence)
    final_docs.append('.'.join(final_sentence))

Finally, filter your array to remove all the empty strings that may have been created in the process of joining:

final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)

Output:

[' Are you blue?', 'This is a great idea. I would love to go.', ' What is your name?', 'I am ready to go. Mom says hello.']

I'll leave the full code here, any suggestion is welcome, I am sure this can be solved in a more functional approach that may be easier to understand, but I am not familiar with it to such a level.

import re
docs = ['Hi, my name is Eric. Are you blue?',
        "Hi, I'm ! What is your name?", 
        'This is a great idea. I would love to go.', 
        'Hello, I am Jane Brown. What is your name?', 
        "Hello, I am a doctor! Let's go to the mall.",
        'I am ready to go. Mom says hello.']

    
splitted_docs = []
for str in docs:
    splitted_docs.append(str.split('.'))


final_docs = []
for str in splitted_docs:

    final_sentence = []
    for sentence in str:
        if not 'Hello' in sentence and not 'Hi' in sentence:
            final_sentence.append(sentence)
    final_docs.append('.'.join(final_sentence))


final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)

CodePudding user response：

pattern = r'(Hi,||Hello) ^(.*?)(\.)'
for string in docs:
    mod_string = re.sub(pattern,'',string)
    print(mod_string)