Home > Software design >  Why is the page source different between Selenium and BeautifulSoup?
Why is the page source different between Selenium and BeautifulSoup?

Time:06-06

As the Title, I am crawling data from the Vietnam's website (https://webgia.com/lai-suat/). I have used BeautifulSoup at first and it does not return the data as its html source showing on Chrome, the data number is hide. However, I changed the method to use Selenium for getting html source and it returns ideally result as all data number has shown.

The code is as below:

Using bs4:

import requests
from bs4 import BeautifulSoup
url = "https://webgia.com/lai-suat/"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
table = soup.find_all('table', attrs={'class': 'table table-radius table-hover text-center'})
table_body = table[0].find('tbody')

rows = table_body.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    for col in cols:
        print(col)

The data is hiden as the result is:

<td ><a  href="https://webgia.com/lai-suat/abbank/" title="Lãi suất ABBank - Ngân hàng TMCP An Bình"><span ></span><span>ABBank</span></a></td>
<td  nb="E3c7370616e20636c617C37B33d2B2746578742d6772H65I656e223e3A02c32303c2f7370616e3Ie"><small>web giá</small></td>
<td  nb="R3ZJ3YKJ2c3F635D"><small>xem tại webgia.com</small></td>
<td  nb="3c7370616e20636Fc61C73733d22746578742dC6772A65656e223e3S42cT303N03c2f7370616e3e"><small>webgia.com</small></td>
<td  nb="352cMA3Z6BE30"><small>web giá</small></td>
<td  nb="352cLXG3A7I30"><small>web giá</small></td>

But if I get html source by using Selenium, then using the same code above:

s = Service(executable_path=ChromeDriverManager().install())
driver = webdriver.Chrome(service = s)
driver.maximize_window()
url = "https://webgia.com/lai-suat/"
driver.get(url)

soup = BeautifulSoup(driver.page_source, 'lxml')
...

The result was showing all data number

<td ><span >0,20</span></td>
<td >3,65</td>
<td ><span >4,00</span></td>
<td >5,60</td>
<td >5,70</td>
<td >5,70</td>
...

So can anyone explain why they have the difference like this? How to get the same result by just using bs4 instead of Selenium? Thank you guys

CodePudding user response:

The difference is because most websites today are shipped with not only HTML, but also JS scripts capable of modifying the HTML when executed. To execute those scripts, a JS engine is required and that's exactly what web browsers provide you with - a JS Engine (V8 for Chrome).

  • HTML contents fetched using BeautifulSoup are "raw" ones, unmodified by any JS scripts because there's no JS engine to execute them in the first place. It is those JS scripts who are in charge of data fetching and updating HTML with the fetched data
  • HTML contents provided by Selenium, on the other hand, are the ones after JS scripts have been executed. Selenium can do this because it has an external webdriver execute the scripts for you, not because Selenium itself can execute JS scripts

Since you'll eventually need a JS engine to execute the JS scripts, I don't think BeautifulSoup alone can cut it.

CodePudding user response:

expanding on the above answer and generally speaking

in order to tell if specific data is fetched/generated by js or returned with the page html you can use a feature in chrome dev tools called block js execution (click inspect then f1) if you keep the chrome dev tools open when you visit the page and the data is there this is a clear indication the data is fetched with the html if its not than its either fetched or generated by js

if the data is fetched simply inspecting the network requests your browser makes while you visit the website you should see the call to fetch the data and you should be able to replicate it using requests module

if not then you have to reverse engineer js by setting a onpageload breakpoint and refreshing the page the js execution will stop on the page being loaded by right clicking the element the data is set to you can click break on subtree modification or attribute modification removing the onpageload and refreshing the page chrome now will break on the js code responsible of the data generation

CodePudding user response:

The page has obfuscated that content and placed in inside an nb attribute of the appropriate tds. When JavaScript runs in the browser the following script content runs which converts the obfuscated data into what you see on the page.

function gm(r) {
    r = r.replace(/A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z/g, "");
    for (var n = [], t = 0; t < r.length - 1; t  = 2) n.push(parseInt(r.substr(t, 2), 16));
    return String.fromCharCode.apply(String, n)
}
$(document).ready(function() {
    $("td.blstg").each(function() {
        var gtls = $(this).attr("nb");
        $(this).removeClass("blstg").removeAttr("nb");
        if (gtls) {
            $(this).html(gm(gtls));
        } else {
            $(this).html("-");
        }
    });
});

With requests this script doesn't run so you are left with the generic text. To answer your question about how to use bs4 to get this, you could write your own custom function(s) to reproduce the logic of the script. Additionally, the class of these target elements attribute conversion is dynamic, so that needs to be picked up also.

N.B. I have swopped the separator "," for "." to avoid pandas stripping out but there is most likely a setting for this you can lookup to preserve.


import requests
from bs4 import BeautifulSoup as bs
import re
import pandas as pd

def gm(r):
    r = re.sub(r'A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z', '', r)
    n = []
    t = 0 
    while t < len(r) - 1:
        n.append(int(r[t:t 2], 16))
        t =2
    return ''.join(map(chr, n))


url = "https://webgia.com/lai-suat/"
req = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(req.text, "lxml")
dynamic_class = re.search(r'\$\("td\.([a-z] )"', req.text).group(1)

for i in soup.select(f'td.{dynamic_class}'):
    replacement = i['nb']
    del i['class']  # not actually needed as I replace innerText
    del i['nb'] # not actually needed as I replace innerText
    if replacement:
        i.string.replace_with(bs(gm(replacement), 'lxml').text.replace(',', '.'))  # to prevent pandas removing separator
    else:
        i.replace_with('-')

df = pd.read_html(str(soup.select_one('.table-radius')))[0]
print(df)
  • Related