Home > Net >  HtmlUnit returning empty list of DomElements
HtmlUnit returning empty list of DomElements

Time:12-16

I am having trouble retrieving the list of Dom Elements when using the method getElementsByName from HtmlPage.

Here is the HTML Page. (Trying to get the CategoriaAgente from the select tag).

HTML (The part that I need):

<select name="CategoriaAgente">
  <option value="-">Escolha uma categoria</option>
  <option value="t">Todos</option>
  <option value="p">Permissionária de Distribuição</option>
  <option value="d">Concessionária de Distribuição</option>
</select>

Snippet of the Java code (Using HtmlUnit):

    public List<HtmlOption> listaAgentes() {
    List<HtmlOption> listaAgentes = null;

    try (WebClient webClient = new WebClient()) {
        log.info("COLETANDO AGENTES");

        // parâmetros do webclient
        webClient.setJavaScriptTimeout(15000);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(300000);

        String url = "https://www2.aneel.gov.br/aplicacoes_liferay/tarifa/";
        HtmlPage page = webClient.getPage(url);
        
        // SELECIONAR CATEGORIA AGENTE
        List<DomElement> listaCategoriaAgente = page.getElementsByName("CategoriaAgente");
       //... 

The list listaCategoriaAgente is ALWAYS empty. I tried some solutions found on S.O. but none of them works. Help? Thanks in advance!

EDIT: After the comment from @hooknc , I found that the page is looking for some kind of captcha from cloudfare. This is what I get from POSTMAN....

enter image description here

Someone knows how to bypass this challenge-form using HtmlUnit? Thanks!!!!!

EDIT 2:

Well, I think I made some progress(?)...

This is the code so far....

try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        webClient.getOptions().setCssEnabled(false);
        webClient.setJavaScriptTimeout(0);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(0);
        webClient.getCookieManager().setCookiesEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getCache().setMaxSize(0);
        webClient.waitForBackgroundJavaScript(10_000);
        webClient.waitForBackgroundJavaScriptStartingBefore(10_000);

        HtmlPage page = null;
        String url = null;

        url = "https://www2.aneel.gov.br/aplicacoes_liferay/tarifa/";
        page = webClient.getPage(url);

        if (page.asXml().contains("Checking if the site connection is secure")) {
            log.info(page.asXml());

            synchronized(page) {
                page.wait(10_000);
            }
            webClient.waitForBackgroundJavaScript(10_000);
        }

And... this is what I get from the log...

<div id="challenge-success" style="display: none;">
      <div >
        <span >
          <img  alt="Success icon" src=" uAAAAEXRSTlMA3zDvfyBAEJC/n3BQz69gX7VMkcMAAAGySURBVEjHnZZbFoMgDEQJiDzVuv/NtgbtFGuQ4/zUKpeMIQbUhXSKE5l1XSn4pFWHRm/WShT1HRLWC01LGxFEVkCc30eYkLJ1Sjk9pvkw690VY6k8DWP9OM9yMG0Koi mi8XA36NXmW0UXra4eJ3iwHfrfXVlgL0NqqGBHdqfeQhMmyJ48WDuKP81h3 SMPeRKkJcSXiLUK4XTHCjESOnz1VUXQoc6lgi2x4cI5aTQ201Mt8wHysI5fc05M5c81uZEtHcMKhxZ7iYEty1GfhLvGKpm EYkdGxm1F5axmcB93DoORIbXfdN7f hlFuyxtDP sxtBnF43cIYwaZAWRgzxIoiXEMESoPlMhwLRDXeK772CAzXEdBRV7cmnoVBp0OSlyGidEzJTFq5hhcsA5388oSGM6b5p qjpZrBlMS9xj4AwXmz108ukU1IomM3ceiW0CDwHCqp1NjAqXlFrbga xuloQJ tuyfbIBPNpqnmxqT7dPaOnZqBfhSBCteJAxWj58zLk2xgg SPGYM6dRO6WczSnIxxwEExRaO UyCUhbOp7CGQ kxSUfNtLQFC Po29vvy7jj4y0yAAAAABJRU5ErkJggg=="/>
        </span>
        Connection is secure
      </div>
      <div >
        Proceeding...
      </div>
    </div>

So... It says Proceeding... but nothing happens... I waited 4ever, but it just stucks on the Proceeding...

Any thoughts?? Thanks!!!

CodePudding user response:

Well. This is what happened. I posted (a related) question, and a guy (possibly from the htmlunit crew) posted an update on git to solve the cookie problem. When using that updated version (2.68.0-SNAPSHOT - and I had to update the version of apache-commons-lang3 too) all the problems disappeared. Cloudflare accepted the connection and everything worked! Here is the final version of the code....

try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        String url = "https://www2.aneel.gov.br:443/aplicacoes_liferay/tarifa/";
        
        // parâmetros do webclient
        webClient.getOptions().setCssEnabled(true);
        webClient.setJavaScriptTimeout(0);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(0);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setRedirectEnabled(true);
        
        CookieManager cookies = new CookieManager();            
        cookies.setCookiesEnabled(true);
        webClient.setCookieManager(cookies);
        
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        
        webClient.waitForBackgroundJavaScript(10000);
        webClient.waitForBackgroundJavaScriptStartingBefore(10000);
        
        webClient.getCache().setMaxSize(0);
        
        java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
        java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
        java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
        
        HtmlPage page = webClient.getPage(url);
        webClient.getRefreshHandler().handleRefresh(page, new URL(url), 10);
        
        synchronized(page) {
            page.wait(10000);
        }
        
        if (page.asXml().contains("Checking if the site connection is secure")) {
            log.info(page.asXml());
            webClient.waitForBackgroundJavaScript(10_000);
        }

        List<DomElement> listaCategoriaAgente = page.getElementsByName("CategoriaAgente");

With the updates, and this piece of code, the list of DOM Elements I needed came properly. Thank you all for the assist!

  • Related