I'm trying to clean up some data from web scraping.
This is an example of the information I'm working with:
Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)
And this is an example of what I'm trying to achieve:
Best Time
Adam Jones (w/ help) (6:34)
Best Time
Kenny Gobbin (2:38)
Personal Best
Matt Herrera (12:44)
No-record
Nick Elizabeth (19:04)
I want to add two new lines after each right parentheses, but as the times are all different, I don't know how I can search and replace it. Also, numbers may sometimes occur outside of the times.
The closest I've come is by searching for numbers inside the parentheses with a colon to separate them, but I don't know how to replace that with the same information.
re.sub(r"\([0-9] :[0-9] \)", "\n\n", result)
Does anyone know how I can achieve this?
CodePudding user response:
You can do it your way with a minimal change. You only have to know about grouping and add \g<0>
right befor \n\n
. You can read about it in the offical documentation in the section about search-and-replace.
re.sub(r"\([0-9] :[0-9] \)", "\g<0>\n\n", result)
Here I used group 0 (the match in ()
) to insert it again. Each set of ()
is a group, counted from the left to the right started with 0.
CodePudding user response:
Notice that the place where you need to insert two newlines comes between an end parenthesis and an alphabetic character. So, you can use:
re.sub(r"\)([A-Za-z])", r")\n\n\1", data)
For example:
import re
data = """Best Time
Adam Jones (w/ help) (6:34)Best Time
Kenny Gobbin (a) (2:38)Personal Best
Matt Herrera (12:44)No-record
Nick Elizabeth (19:04)"""
result = re.sub(r"\)([A-Za-z])", r")\n\n\1", data)
print(result)
outputs:
Best Time
Adam Jones (w/ help) (6:34)
Best Time
Kenny Gobbin (a) (2:38)
Personal Best
Matt Herrera (12:44)
No-record
Nick Elizabeth (19:04)
Here's an explanation for how it works:
For the expression we're trying to match, we have r"\)([A-Za-z])"
:
\)
matches a literal end parenthesis.[A-Za-z]
matches a single alphabetic character.- Enclosing
[A-Za-z]
in parentheses makes it a capture group that we refer to later.
For the replacement expression, we have r")\n\n\1"
:
)\n\n
adds an end parenthesis plus two new lines.\1
refers to the capture group from earlier. Intuitively, we capture the alphabetic character immediately after the end parenthesis, and then add that same character back into the replacement expression.