Home > Net >  Why does this not work? Python webscraping
Why does this not work? Python webscraping

Time:11-05

enter image description hereI use this code to get all the text within the li tags back, but it doesn't work.

from bs4 import BeautifulSoup
import requests
page = requests.get("https://archief.amsterdam/inventarissen/scans/31245/120.3")
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.find_all('#modal > div > div.content > div > div > ul > li:nth-child(1) > span.file-name')

for i in range(len(result)):
    print(result[i].text.strip())

print(len(result))

image of the website where i want data from

CodePudding user response:

It looks like the site is creating those tags using JavaScript, and the requests module doesn't run JS at all, so the tags never appear in page.content.

You could use something like requests-html or Selenium to allow for the JS to run before you access the content, or scrape the data the page loads directly (I checked, and there's a request made to the server that returns the data you need in JSON format. Check the Network tab of your browser's Developer Tools while loading the page for more info/if you want to use that).

Also,

  • You could simplify your selector to li span.file-name, assuming you want to get every filename.
  • Python supports for-loops like this: for result in results, so you could use that instead of the more traditional/JavaScript-y variety. I'll put an example below.
# This is assuming the "result" variable is renamed to "results".
for result in results:
    print(result.text.strip())

print(len(results))

Data Scraping Method (response to comment)

  1. Replace the webpage's URL in your call to requests.get with the API's.
  2. Convert the JSONP text returned by the server to regular JSON, so that we can parse it using Python's standard json library.
  3. Iterate through the parsed JSON, pulling out the value of "name" and adding it to some list.

Full example:

import json
import requests

# The URL from the network tab.
api_url = "https://webservices.picturae.com/archives/scans/31245/120.3?apiKey=eb37e65a-eb47-11e9-b95c-60f81db16c0e&lang=nl_NL&findingAid=31245&path=120.3&callback=callback_json5"
response = requests.get(api_url)
# The split() and strip() calls here remove parts of the request
# that are JSONP, not JSON. We need just the JSON data.
raw_json = response.text.split("(", 1)[1].strip(")")
# Load the JSON data into a regular Python dictionary.
data = json.loads(raw_json)
# Add all the filenames from the data into the filenames list.
filenames = []
for scan in data["scans"]["scans"]:
    filename = scan["name"]
    print(filename)
    filenames.append(filename)

print("\nFilename count:", len(filenames))
  • Related