Why can't Selenium find the table in this website?-CodePudding

I'm trying to scrape a table that appears in a new website when you fulfill a form, but it raises a ValueError: No tables found. You can see the code details below:

ANS_TabNetMLR_URL = navegador.get("http://www.ans.gov.br/anstabnet/cgi-bin/dh?dados/tabnet_rc.def")
selecionarLinha = Select(navegador.find_element(By.XPATH, '//*[@id="L"]'))
selecionarLinhaOperadora = selecionarLinha.select_by_visible_text('Operadora')

selecionarColuna = Select(navegador.find_element(By.XPATH, '//*[@id="C"]'))
selecionarColunaModalidade = selecionarColuna.select_by_visible_text('Grupo Modalidade')

selecionarConteudo = Select(navegador.find_element(By.XPATH, '//*[@id="I"]'))
selecionarConteudo.deselect_by_visible_text('Receita de contraprestações')
selecionarConteudoMLR = selecionarConteudo.select_by_visible_text('Despesa assistencial')

selecionarMostra = navegador.find_element(By.XPATH, '//*[@id="geral"]/thead/tr[2]/td[2]/center/form/table[2]/tbody/tr[4]/td/p[2]/input[1]').click()

novaURL = 'http://www.ans.gov.br/anstabnet/cgi-bin/tabnet?dados/tabnet_rc.def'
aguardar = WebDriverWait(navegador, 10).until(ec.url_to_be(novaURL))

encontrarTabela = navegador.find_element(By.XPATH, '//*[@id="geral"]/thead/tr[2]/td[2]/center/table/tbody')

HTML_tabela_MLR = encontrarTabela.get_attribute('outerHTML')
sopa = BeautifulSoup(HTML_tabela_MLR, 'html.parser')

tabela = sopa.find(name = 'table border')
df_completo_MLR = pd.DataFrame()

lista_completa_MLR=pd.read_html(str(tabela),index_col=('Operadora'), header=(0), thousands='.')

And so the console output:

Traceback (most recent call last):

  File "C:\Users\vitor.dias\Documents\ANSMLR.py", line 206, in <module>
    lista_completa_MLR=pd.read_html(str(tabela),index_col=('Operadora'), header=(0), thousands='.')

  File "C:\Users\vitor.dias\Anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)

  File "C:\Users\vitor.dias\Anaconda3\lib\site-packages\pandas\io\html.py", line 1113, in read_html
    return _parse(

  File "C:\Users\vitor.dias\Anaconda3\lib\site-packages\pandas\io\html.py", line 939, in _parse
    raise retained

  File "C:\Users\vitor.dias\Anaconda3\lib\site-packages\pandas\io\html.py", line 919, in _parse
    tables = p.parse_tables()

  File "C:\Users\vitor.dias\Anaconda3\lib\site-packages\pandas\io\html.py", line 239, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)

  File "C:\Users\vitor.dias\Anaconda3\lib\site-packages\pandas\io\html.py", line 569, in _parse_tables
    raise ValueError("No tables found")

ValueError: No tables found

I'm grateful since now for any help and I'm sorry if it's too dumb LOL I'm just starting on the Stackoverflow and using Selenium.

I've tried to scrape a table through its HTML to transform it on dataframe in Pandas, but then it couldn't find the table, even that I could see it doing this process manually.

CodePudding user response：

Suggested solution: I don't think you need bs4 at all here - you should try

tableEl = navegador.find_element(By.XPATH, '//center/table')
tableHtml = tableEl.get_attribute('outerHTML')

lista_completa_MLR=pd.read_html(tableHtml) # ,index_col=('Operadora'), header=(0), thousands='.')
# [I recommend trying without extra arguments first; if it works, try again with all arguments.]

Explanation[s]: There are a few things I feel the need to point out about your code:

there is no need to use BeautifulSoup on outerHTML only to stringify the soup [it's redundant]
table border is not a tag name [I don't think tag names can have spaces in them]. table is the tag name and border is an

[tbody is inside the table], so str(tabela) will just be "None" for two separate reasons, and
- for pd.read_html to work, there needs to be at least one table tag in the input
Why can't Selenium find the table in this website?

Given that the code ran without errors before the last line in the snippet, I'd say that Selenium did find the table. It was Pandas that couldn't find the table, because there were no table tags to find in the input that was passed to read_html