Home > other >  Adding line breaks after times in parentheses
Adding line breaks after times in parentheses

Time:03-29

I'm trying to clean up some data from web scraping.

This is an example of the information I'm working with:

Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)

And this is an example of what I'm trying to achieve:

Best Time
Adam Jones (w/ help) (6:34)

Best Time
Kenny Gobbin (2:38)

Personal Best
Matt Herrera (12:44)

No-record
Nick Elizabeth (19:04)

I want to add two new lines after each right parentheses, but as the times are all different, I don't know how I can search and replace it. Also, numbers may sometimes occur outside of the times.

The closest I've come is by searching for numbers inside the parentheses with a colon to separate them, but I don't know how to replace that with the same information.

re.sub(r"\([0-9] :[0-9] \)", "\n\n", result)

Does anyone know how I can achieve this?

CodePudding user response:

You can do it your way with a minimal change. You only have to know about grouping and add \g<0> right befor \n\n. You can read about it in the offical documentation in the section about search-and-replace.

re.sub(r"\([0-9] :[0-9] \)", "\g<0>\n\n", result)

Here I used group 0 (the match in ()) to insert it again. Each set of () is a group, counted from the left to the right started with 0.

CodePudding user response:

Notice that the place where you need to insert two newlines comes between an end parenthesis and an alphabetic character. So, you can use:

re.sub(r"\)([A-Za-z])", r")\n\n\1", data)

For example:

import re
data = """Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)"""

result = re.sub(r"\)([A-Za-z])", r")\n\n\1", data)
print(result)

outputs:

Best Time
Adam Jones (w/ help) (6:34)

Best Time
Kenny Gobbin (a) (2:38)

Personal Best
Matt Herrera (12:44)

No-record
Nick Elizabeth (19:04)

Here's an explanation for how it works:

For the expression we're trying to match, we have r"\)([A-Za-z])":

  • \) matches a literal end parenthesis.
  • [A-Za-z] matches a single alphabetic character.
  • Enclosing [A-Za-z] in parentheses makes it a capture group that we refer to later.

For the replacement expression, we have r")\n\n\1":

  • )\n\n adds an end parenthesis plus two new lines.
  • \1 refers to the capture group from earlier. Intuitively, we capture the alphabetic character immediately after the end parenthesis, and then add that same character back into the replacement expression.
  • Related