Home > Mobile >  Python Web Scraping - HTML error returning incomplete
Python Web Scraping - HTML error returning incomplete

Time:12-08

When using my code, HTML is coming back missing data. What can it be ?
Before, everything was working fine, until changes were made to the code for expected conditions Selenium,

Code is not all complete because it was not accepted here, but I think you can see what is happening.

navegador = webdriver.Firefox(options = options)

wait = WebDriverWait(navegador, 30)

link = '******'
navegador.get(url = link)

wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtLogin"))).send_keys('******')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtSenha"))).send_keys('******')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_btnEnviar"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_TreeView2t8"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[title='07 de dezembro']"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"))).click()
wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"]/option[2]'))).click()
teste = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="divScroll"]'))).get_attribute('innerHTML')

soup = BeautifulSoup(teste, "html.parser")

I get the following back.

<table align="center" style="border-right: #66cc00 1px solid; border-top: #66cc00 1px solid; border-left: #66cc00 1px solid; border-bottom: #66cc00 1px solid" width="100%">
<tbody><tr>
<td>
<table>
<tbody><tr>
<td >
<span id="ctl00_ctl00_Content_Content_Label1" style="font-size:12px;">Terminal - Empresa - Exportador:</span>
</td>
<td>
<select  id="ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa" name="ctl00$ctl00$Content$Content$ddlVagasTerminalEmpresa" onchange="javascript:setTimeout('__doPostBack(\'ctl00$ctl00$Content$Content$ddlVagasTerminalEmpresa\',\'\')', 0)" style="width: 475px;">
<option selected="selected" value="0">Selecione um Terminal.</option>
<option value="68623">TEAG - CARGILL - 04 CARGILL AGRICOLA S A  -  GUARUJA - SP</option>
<option value="68594">TEG  - CARGILL - 04 CARGILL AGRICOLA S A  -  GUARUJA - SP</option>
</select>
</td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td >
<span id="ctl00_ctl00_Content_Content_lbl_titulo_principal" style="font-size:12px;">Disponibilização de vagas do dia: 07/12/2022</span></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td valign="top">
</td>
</tr>
<tr>

I should get that back.

        </tr>
    <tr>
        <td></td>
    </tr>
    <tr>
        <td valign="top">
            <div id="ctl00_ctl00_Content_Content_pn_turno_1" style="width:100%;">
    
            <table width="100%" style="border-right: #66cc00 1px solid; border-top: #66cc00 1px solid; border-left: #66cc00 1px solid; border-bottom: #66cc00 1px solid">
                <tbody><tr>
                    <td >
                        <span id="ctl00_ctl00_Content_Content_lbl_turno_1">Turno 01 - intervalo: 7/12/2022 0:00:00 as 7/12/2022 1:00:00</span></td>
                </tr>
                <tr>
                    <td style="height:200px;width: 100%;" valign="top">
                        <table border="0"  cellpadding="4" cellspacing="2" style="font-size:14;width: 100%;z-index: -1;">
                                                                   
                                    </table>                                                                    
                                    <table border="0"  cellpadding="3" cellspacing="2" style="font-size:14;width: 100%">
                                
                                    <tbody><tr >                                
                                        <td width="12%" align="center">
                                            <span id="ctl00_ctl00_Content_Content_rpt_turno_1_ctl01_lblEmpresaTerminal_1" title="TEAG - CARGILL - 04 CARGILL AGRICOLA S A  -  GUARUJA - SP" style="font-size:7px;">CARGILL - TEAG</span>
                                            <input type="image" name="ctl00$ctl00$Content$Content$rpt_turno_1$ctl01$imb_vaga_1" id="ctl00_ctl00_Content_Content_rpt_turno_1_ctl01_imb_vaga_1" title="Vaga agendada." src="../App_Themes/SisLog/Images/caminhao.png" onclick="javascript:window.open('Cadastro.aspx?id_agenda=7054462&amp;id_turno=7/12/2022 0:00:00;7/12/2022 1:00:00&amp;data=07/12/2022&amp;id_turno_exportador=198574&amp;id_turno_agenda=61348&amp;id_transportadora=23213&amp;id_turno_transp=68623&amp;id_Cliente=7708&amp;codigo_terminal=7708&amp;codigo_empresa=1&amp;codigo_exportador=24978&amp;codigo_transportador=23213&amp;codigo_turno=1&amp;turno_transp_vg=68623','_blank','height=850,width=1000,top=(screen.width)?(screen.width-1000)/2 : 0,left=(screen.height)?(screen.height-700)/2 : 0,toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=yes,resizable=no');" style="height:20px;border-width:0px;">                                                
                                        </td>

CodePudding user response:

Since you did not share a link to the page you working on we can only guess what can cause your problem.
So, I guess you are extracting the text from not fully rendered element.
To try fix this try changing from presence_of_element_located to visibility_of_element_located in this line teste = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="divScroll"]'))).get_attribute('innerHTML') so it will be

teste = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="divScroll"]'))).get_attribute('innerHTML')

In case this will not be enough try adding some delay before extracting the text, as following:

wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="divScroll"]')))
time.sleep(2)
teste = navegador.find_element(By.XPATH, '//*[@id="divScroll"]').get_attribute('innerHTML')

And in case that element is not visible so that visibility_of_element_located can not be applied on it just use presence_of_element_located with delay

wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="divScroll"]')))
time.sleep(2)
teste = navegador.find_element(By.XPATH, '//*[@id="divScroll"]').get_attribute('innerHTML')
  • Related