regex pattern matching incorrectly with python-CodePudding

Within a string, I'm trying to match for all characters before the first comma, but I'm getting matches like this also:

It takes hard, daft,

Not once did I stop and say, as I do now,

Below is my regex:

match = re.match(r".*,", temp)

example:

list = ['In the morning, frank crashed his car.', 'Basically, he doesn't know how to drive.']

output_list = []

for i in list:
     match = re.match(r".*,", i)
     output_list.append(match.group())

I want to extract these two:

In the morning,

Basically,

CodePudding user response：

Match everything before the first comma:

^(. ?),

Example:

import re 

list = ['In the morning, frank crashed his car.', 'Basically, he doesn\'t know how to drive.']

output_list = []

for i in list:
     match = re.match(r"^(. ?),", i)
     output_list.append(match.group())
    
print(output_list)

Output:

['In the morning,', 'Basically,']

This website is great for learning regex: https://regex101.com/

CodePudding user response：

I am assuming you want to match anything before the first occurrence of a comma character. If this is the case, try matching your text against this regex [^,]* that in Python looks as follows:

match = re.match(r"[^,]*", temp)

On top of that, maybe you will find this sandbox helpful for your trial and error: https://regexr.com/

However, instead of leveraging regexes, I'd suggest to split the string on comma characters and then pick for each the 1st element of the list holding the split string, e.g.

list = ['In the morning, frank crashed his car.', "Basically, he doesn't know how to drive."]

output_list = []

for i in list:
    output_list.append(i.split(',')[0])

CodePudding user response：

You don't need to use regex for this situation, as you could use str.find() and then slice the string from the beginning of the string until the found position.

#!/usr/bin/env python3

sentences = [
    "In the morning, frank crashed his car.",
    "Basically, he doesn't know how to drive."]

output_list = []

for sentence in sentences:
    pos = sentence.find(",")
    if pos != -1:
        # since you also want the ',', slice to pos 1
        output_list.append(sentence[0:pos 1])

print(output_list)

The output:

['In the morning,', 'Basically,']

If you wanted to use re to do this, you have to fix your regex to use a non-greedy match on the *, which is greedy by default and will try to match as much as possible, as described in the re docs.

*?, ?, ??

The '*', ' ', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against ' b ', it will match the entire string, and not just ''. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.?> will match only ''.

Like this probably does what you want (untested):

#!/usr/bin/env python3

import re

sentences = [
    "In the morning, frank crashed his car, yep.",
    "Basically, he doesn't know how to drive."]

output_list = []

for sentence in sentences:
    if match := re.match(r".*?,", sentence):
        output_list.append(match[0])

print(output_list)