Home > Enterprise >  How can I get information from a web site using BeautifulSoup in python?
How can I get information from a web site using BeautifulSoup in python?

Time:01-08

I have to take the publication date displayed in the following web page with BeautifulSoup in python:

https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410

The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().

I tried this code:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)

soup.find_all('span', id_= 'biblio-publication-number-content')

but it gives me [], while in the 'inspect' code of the online page, there is this tag.

What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?

How can I solve this issue and get the number?

CodePudding user response:

The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.

For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs

options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)

try:

    driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')

    # Wait (for up to 10 seconds) for the element we want to appear:
    driver.implicitly_wait(10)
    elem = driver.find_element(By.ID, 'biblio-publication-number-content')

    # Now we can use soup:
    soup = bs(driver.page_source, "html.parser")
    print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
    driver.quit()

Prints:

<span id="biblio-publication-number-content"><span >CN105030410</span>A·2015-11-11</span>

CodePudding user response:

Umberto if you are looking for an html element span use the following code:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')

results = soup.find_all('span')
[r for r in results]

if you are looking for an html with the id 'biblio-publication-number-content' use the following code


import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')

soup.find_all(id='biblio-publication-number-content')

in first case you are fetching all span html elements in second case you are fetching all elements with an id 'biblio-publication-number-content'

I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.

  • Related