I am trying to scrape a list of trade publications from: https://www.webwire.com/IndustryList.asp
, using beautifulsoup and requests. When I inspect the page contents with my browser, I see a list:
<ul id="syndication-list">
<li>15 Minutes More Productions</li>
<li>AAA Go Magazine</li>
<li>AAA Going Places</li>
<li>AAA Motorist</li>
</ul>
But when I use requests
, the list is empty, and I only see:
</ul></div>
How can I scrape the items in the list?
import requests
page = requests.get('https://www.webwire.com/TradePublications.asp?ind=LEI')
print(page.text)
CodePudding user response:
It's working
import requests
from bs4 import BeautifulSoup
url = "https://www.webwire.com/TradePublications.asp?ind=LEI"
page = requests.get(url)
#print(url)
soup = BeautifulSoup(page.content, 'html.parser')
for e in soup.select('#syndication-list li'):
print(e.get_text())
Output:
101 North Magazine (Gannett Pacific Publications)
15 Minutes More Productions
AAA Go Magazine
AAA Going Places
AAA Motorist
AAA World
AAHOA Lodging Business Magazine
Adfax
Admark Marketing Report
Adweek
African Americans on Wheels magzine
Agent@Home magazine
Air Transport World Magazine
Airguide Magazine & AirguideOnline.com
AIRS
Alaska Airlines Magazine
America West Magazine
American Executive magazine
American Express Publishing
American Fitness
American Media
American Profile
AMERICAN ROAD MAGAZINE
American Salon Magazine
American Saver Magzine
CodePudding user response:
The page you mentioned uses client side rendering to create the list you are trying to scrape. If you send a simple http request to the page, it only responds with the JavaScript code (or a link to it), that renders these elements and the browser is responsible for rendering it. There are two main options, if you want to get this content:
Run the page in a headless browser
You can run the page in a headless browser, that will render the data and after they render, you can scrape them. There are a few options for that, but for python the most common one is selenium.
Tracing the requests
You can also look at the website source code (or using the network tab in devtools) and determine, how it populates the elements you are looking for (usually using some sort of an API). You can then reproduce the requests the website does and access the information direcly.
The headless browser option is easier and quicker to set up, but it is way slower and uses more resources. You need to asses, what method better fits your use case.