Home > OS >  re.findall outputs blanks along with correct
re.findall outputs blanks along with correct

Time:12-08

I'm trying to get the list output to not have subgroups or empty spaces. I'm trying to stick with a RegEx only solution due to my re.split and array manipulation method is really janky and sort of slow.

HTML file: (Notice that thing 3 & 4 have /b/ before instead of /a/.)

<!DOCTYPE html>
<html>
    <head></head>   
    <body></body>
        <a href="example.com/a/thing1"></a>
        <a href="example.com/a/thing2"></a>
        <a href="example.com/b/thing3"></a>
        <a href="example.com/b/thing4" ><img src="/thing4.png"></a>
    </body>
</html>

Python file:

import re

html = open("help.html", "r").read()
links = re.findall('((?<=\.com\/a\/).*(?="))|((?<=\.com\/b\/).*(?=" ><))|((?<=\.com\/b\/).*(?="><\/a))',html)

print(links)

What will output when I run the above py file:

[('thing1', '', ''), ('thing2', '', ''), ('', '', 'thing3'), ('', 'thing4', '')]

What I want it to output:

[thing1, thing2, thing3, thing4]

CodePudding user response:

You just have to remove the capturing groups. As stated in re.findall:

Empty matches are included in the result.

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

An example of capturing group is ((?<=\.com\/a\/).*(?=")), so the most external brackets shall be removed, same for the other 2 groups:

links = re.findall('(?<=\.com\/a\/).*(?=")|(?<=\.com\/b\/).*(?=" ><)|(?<=\.com\/b\/).*(?="><\/a)',HTML)

Output:

['thing1', 'thing2', 'thing3', 'thing4']
  • Related