enter image description hereI use this code to get all the text within the li tags back, but it doesn't work.
from bs4 import BeautifulSoup
import requests
page = requests.get("https://archief.amsterdam/inventarissen/scans/31245/120.3")
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.find_all('#modal > div > div.content > div > div > ul > li:nth-child(1) > span.file-name')
for i in range(len(result)):
print(result[i].text.strip())
print(len(result))
image of the website where i want data from
CodePudding user response:
It looks like the site is creating those tags using JavaScript, and the requests module doesn't run JS at all, so the tags never appear in page.content
.
You could use something like requests-html or Selenium to allow for the JS to run before you access the content, or scrape the data the page loads directly (I checked, and there's a request made to the server that returns the data you need in JSON format. Check the Network tab of your browser's Developer Tools while loading the page for more info/if you want to use that).
Also,
- You could simplify your selector to
li span.file-name
, assuming you want to get every filename. - Python supports for-loops like this:
for result in results
, so you could use that instead of the more traditional/JavaScript-y variety. I'll put an example below.
# This is assuming the "result" variable is renamed to "results".
for result in results:
print(result.text.strip())
print(len(results))
Data Scraping Method (response to comment)
- Replace the webpage's URL in your call to
requests.get
with the API's. - Convert the JSONP text returned by the server to regular JSON, so that we can parse it using Python's standard
json
library. - Iterate through the parsed JSON, pulling out the value of "name" and adding it to some list.
Full example:
import json
import requests
# The URL from the network tab.
api_url = "https://webservices.picturae.com/archives/scans/31245/120.3?apiKey=eb37e65a-eb47-11e9-b95c-60f81db16c0e&lang=nl_NL&findingAid=31245&path=120.3&callback=callback_json5"
response = requests.get(api_url)
# The split() and strip() calls here remove parts of the request
# that are JSONP, not JSON. We need just the JSON data.
raw_json = response.text.split("(", 1)[1].strip(")")
# Load the JSON data into a regular Python dictionary.
data = json.loads(raw_json)
# Add all the filenames from the data into the filenames list.
filenames = []
for scan in data["scans"]["scans"]:
filename = scan["name"]
print(filename)
filenames.append(filename)
print("\nFilename count:", len(filenames))