I'm trying to extract image URLs from this code:
<div data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
How can I find the URLs in data-src?
I'm using beautiful soup and find function but I have no idea how to extract links because I don't see img tag as usual...
Thank you for your time in advance
CodePudding user response:
If you can't use an HTML parser for whatever reason, then you can use regex.
import re
text = '''
<div data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
'''
parsed = re.search('(?<=data-src=").*(?=" )', text).group(0)
print(parsed)
CodePudding user response:
You can try the following:
from bs4 import BeautifulSoup
html = """
<div data-featured-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" data-src="https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg" style='background-image: url("https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg");'></div>
"""
soup = BeautifulSoup(html, "html.parser")
url = soup.select_one(
"div.theme-screenshot.one.attachment-theme-screenshot.size-theme-screenshot.wp-post-image.loaded"
).get("data-src")
print(url)
This will return:
https://websitedemos.net/wp-content/uploads/2019/07/outdoor-adventure-02-home.jpg
Documentation for BeautifulSoup(bs4) can be found at: