Home > Mobile >  Python remove sentence if it is at start of string and starts with specific words?
Python remove sentence if it is at start of string and starts with specific words?

Time:06-10

I have strings that looks like

docs = ['Hi, my name is Eric. Are you blue?',
        "Hi, I'm ! What is your name?", 
        'This is a great idea. I would love to go.', 
        'Hello, I am Jane Brown. What is your name?', 
        "Hello, I am a doctor! Let's go to the mall.",
        'I am ready to go. Mom says hello.']

I want to remove the first sentence of a string if it starts with 'Hi' or 'Hello'.

Desired Output:

docs = ['Are you blue?',
        'What is your name?', 
        'This is a great idea. I would love to go.', 
        'What is your name?', 
        "Let's go to the mall."
        'I am ready to go. Mom says hello.']

The regex I have is

re.match('.*?[a-z0-9][.?!](?= )', x))

But this only give be the first sentence in weird format like:

<re.Match object; span=(0, 41), match='Hi, my name is Eric.'>

What can I do to get my desired output? Thanks"

CodePudding user response:

You can use

docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s ', '', doc) for doc in docs]

See the regex demo. Details:

  • ^ - start of string
  • H(?:ello|i)\b - Hello or Hi word (\b is a word boundary)
  • .*? - any zero or more chars other than line break chars as few as possible
  • [.?!] - a ., ? or !
  • \s - one or more whitespaces.

See the Python demo:

import re
docs = ['Hi, my name is Eric. Are you blue?',
        "Hi, I'm ! What is your name?", 
        'This is a great idea. I would love to go.', 
        'Hello, I am Jane Brown. What is your name?', 
        "Hello, I am a doctor! Let's go to the mall.",
        'I am ready to go. Mom says hello.']
docs = [re.sub(r'^H(?:ello|i)\b.*?[.?!]\s ', '', doc) for doc in docs]
print(docs)

Output:

[
    'Are you blue?',
    'What is your name?',
    'This is a great idea. I would love to go.',
    'What is your name?',
    "Let's go to the mall.",
    'I am ready to go. Mom says hello.'
]

CodePudding user response:

You would have to first split the string in sentences

splitted_docs = []
for str in docs:
    splitted_docs.append(str.split('.'))

Then, you want to check each sentence for Hi or Hello with your regex and add it to the final array

final_docs = []
for str in splitted_docs:
    final_sentence = []
    for sentence in str:
        if not re.match('.*?[a-z0-9][.?!](?= )', sentence):
            final_sentence.append(sentence)
    final_docs.append(final_sentence.join('.'))

Actually, your regex is not working, just changed the code to make it work, i goes just like follows:

for str in splitted_docs:

    final_sentence = []
    for sentence in str:
        if not 'Hello' in sentence and not 'Hi' in sentence:
            final_sentence.append(sentence)
    final_docs.append('.'.join(final_sentence))

Finally, filter your array to remove all the empty strings that may have been created in the process of joining:

final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)

Output:

[' Are you blue?', 'This is a great idea. I would love to go.', ' What is your name?', 'I am ready to go. Mom says hello.']

I'll leave the full code here, any suggestion is welcome, I am sure this can be solved in a more functional approach that may be easier to understand, but I am not familiar with it to such a level.

import re
docs = ['Hi, my name is Eric. Are you blue?',
        "Hi, I'm ! What is your name?", 
        'This is a great idea. I would love to go.', 
        'Hello, I am Jane Brown. What is your name?', 
        "Hello, I am a doctor! Let's go to the mall.",
        'I am ready to go. Mom says hello.']

    
splitted_docs = []
for str in docs:
    splitted_docs.append(str.split('.'))


final_docs = []
for str in splitted_docs:

    final_sentence = []
    for sentence in str:
        if not 'Hello' in sentence and not 'Hi' in sentence:
            final_sentence.append(sentence)
    final_docs.append('.'.join(final_sentence))


final_docs = list(filter(lambda x: x != '', final_docs))
print(final_docs)

CodePudding user response:

pattern = r'(Hi,||Hello) ^(.*?)(\.)'
for string in docs:
    mod_string = re.sub(pattern,'',string)
    print(mod_string)
  • Related