I'm trying to get the list output to not have subgroups or empty spaces. I'm trying to stick with a RegEx only solution due to my re.split and array manipulation method is really janky and sort of slow.
HTML file: (Notice that thing 3 & 4 have /b/
before instead of /a/
.)
<!DOCTYPE html>
<html>
<head></head>
<body></body>
<a href="example.com/a/thing1"></a>
<a href="example.com/a/thing2"></a>
<a href="example.com/b/thing3"></a>
<a href="example.com/b/thing4" ><img src="/thing4.png"></a>
</body>
</html>
Python file:
import re
html = open("help.html", "r").read()
links = re.findall('((?<=\.com\/a\/).*(?="))|((?<=\.com\/b\/).*(?=" ><))|((?<=\.com\/b\/).*(?="><\/a))',html)
print(links)
What will output when I run the above py file:
[('thing1', '', ''), ('thing2', '', ''), ('', '', 'thing3'), ('', 'thing4', '')]
What I want it to output:
[thing1, thing2, thing3, thing4]
CodePudding user response:
You just have to remove the capturing groups. As stated in re.findall:
Empty matches are included in the result.
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
An example of capturing group is ((?<=\.com\/a\/).*(?="))
, so the most external brackets shall be removed, same for the other 2 groups:
links = re.findall('(?<=\.com\/a\/).*(?=")|(?<=\.com\/b\/).*(?=" ><)|(?<=\.com\/b\/).*(?="><\/a)',HTML)
Output:
['thing1', 'thing2', 'thing3', 'thing4']