Home > OS >  regex pattern matching incorrectly with python
regex pattern matching incorrectly with python

Time:09-06

Within a string, I'm trying to match for all characters before the first comma, but I'm getting matches like this also:

It takes hard, daft,

Not once did I stop and say, as I do now,

Below is my regex:

match = re.match(r".*,", temp)

example:

list = ['In the morning, frank crashed his car.', 'Basically, he doesn't know how to drive.']

output_list = []

for i in list:
     match = re.match(r".*,", i)
     output_list.append(match.group())

I want to extract these two:

In the morning,

Basically,

CodePudding user response:

Match everything before the first comma:

^(. ?),

Example:

import re 

list = ['In the morning, frank crashed his car.', 'Basically, he doesn\'t know how to drive.']

output_list = []

for i in list:
     match = re.match(r"^(. ?),", i)
     output_list.append(match.group())
    
print(output_list)

Output:

['In the morning,', 'Basically,']

This website is great for learning regex: https://regex101.com/

CodePudding user response:

I am assuming you want to match anything before the first occurrence of a comma character. If this is the case, try matching your text against this regex [^,]* that in Python looks as follows:

match = re.match(r"[^,]*", temp)

On top of that, maybe you will find this sandbox helpful for your trial and error: https://regexr.com/

However, instead of leveraging regexes, I'd suggest to split the string on comma characters and then pick for each the 1st element of the list holding the split string, e.g.

list = ['In the morning, frank crashed his car.', "Basically, he doesn't know how to drive."]

output_list = []

for i in list:
    output_list.append(i.split(',')[0])

CodePudding user response:

You don't need to use regex for this situation, as you could use str.find() and then slice the string from the beginning of the string until the found position.

#!/usr/bin/env python3

sentences = [
    "In the morning, frank crashed his car.",
    "Basically, he doesn't know how to drive."]

output_list = []

for sentence in sentences:
    pos = sentence.find(",")
    if pos != -1:
        # since you also want the ',', slice to pos 1
        output_list.append(sentence[0:pos 1])

print(output_list)

The output:

['In the morning,', 'Basically,']

If you wanted to use re to do this, you have to fix your regex to use a non-greedy match on the *, which is greedy by default and will try to match as much as possible, as described in the re docs.

*?, ?, ??

The '*', ' ', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against ' b ', it will match the entire string, and not just ''. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.?> will match only ''.

Like this probably does what you want (untested):

#!/usr/bin/env python3

import re

sentences = [
    "In the morning, frank crashed his car, yep.",
    "Basically, he doesn't know how to drive."]

output_list = []

for sentence in sentences:
    if match := re.match(r".*?,", sentence):
        output_list.append(match[0])

print(output_list)
  • Related