Home > Enterprise >  Using BeautifulSoup to get text and image urls from HTML string
Using BeautifulSoup to get text and image urls from HTML string

Time:05-19

class BlogApi(object):
def __init__(self):
    json = "https://remaster.realmofthemadgod.com/index.php?rest_route=/wp/v2/posts/"
    with urllib.request.urlopen(f"{rotmgjson}") as url:
        self.post_json = json.loads(url.read().decode())

async def content(self, thread=0, parse=True):
    """Returns content of blog post as string.
    Thread is 0 (latest) by default.
    Parse is True by default."""
    dirty_content = self.post_json[thread]['content']['rendered']
    if not parse:
        return dirty_content
    else:
        soup = BeautifulSoup(dirty_content, features="html.parser")
        images = []
        for img in soup.findAll('img'):
            images.append(img.get('src'))
        images = soup.find_all('img', {'src':re.compile('.png')})
        return images, soup.text

I'm using the above class to get all text and image URLs from an HTML string. The full string looks like this https://controlc.com/c3cdf2ef.

My problem is, the image URLs are not in the same string as the text obviously. And my goal is to have them in the same text position as they are in the webpage. So for example, my returned string should be like:

https://remaster.realmofthemadgod.com/wp-content/uploads/2022/05/steam_forgerenovation.png
Realmers,
The Forge is about to change. Coming in June the blacksmith will present the Heroes with her renovated forge. She’s equipped it with better and more reliable equipment, capable of making items no one thought could be made that way. 
Here’s what’s going to change:
https://remaster.realmofthemadgod.com/wp-content/uploads/2022/04/c5c640b1-a033-4547-aabb-5af37a8ce4c5-1024x616.png ...

Its actually way longer with more images. But yeah.

CodePudding user response:

You could go ahead and replace the <img src=my_image.png/> element with the the src text, like

for image in (images := soup.find_all('img', {'src':re.compile('.png')})):
    image.replace_with(image.get('src'))

This will leave you with only the text when calling soup.text. It's more of a "pragmatic solution", though, rather than anything fancy let alone recommended approach.

  • Related