I'm doing some web scraping with form data, and I've run into a situation that I can't handle.
I need to get a table that is generated from a form with some options, as shown in the image below:
The website is this: https://aplicacoes.mds.gov.br/cadsuas/pesquisarConsultaExterna.html
For this, I tried to develop a small script, according to the code below:
import pandas as pd
import requests
tipo = 'Rede Socioassistencial'
uf = 'PR'
municipio = 'Campo Largo'
url = 'https://aplicacoes.mds.gov.br/cadsuas/pesquisarConsultaExterna.html'
payload = {
'consultaExternaHelper.tipoBusca':'%s' %tipo,
'consultaExternaHelper.endereco.municipio.uf.sigla': '%s' %uf,
'consultaExternaHelper.endereco.municipio.id': '%s' %municipio}
response = requests.post(url, params=payload)
df = pd.read_html(response.text)
However, I have no experience with this type of application and, therefore, the result obtained is far from what was expected, as can be seen:
[ 0 1 2 \
0 Bem vindo! NaN NaN
1 O CadSUAS é o sistema de cadastro do SUAS, que... NaN NaN
2 PESQUISAR PESQUISAR PESQUISAR
3 4 5
0 NaN NaN NaN
1 NaN NaN NaN
2 PESQUISAR PESQUISAR PESQUISAR , 0 \
0 Tipo de Busca: Rede Socioassistencial Órgãos G...
1 * UF:
2 CPF
3 Tipo:
4 NaN
1 \
0 Tipo de Busca: Rede Socioassistencial Órgãos G...
1 Selecionar AC AL AM AP BA CE DF ES GO...
2 Nome:
3 Selecionar CRAS CREAS
4 NaN
2 \
0 Tipo de Busca: Rede Socioassistencial Órgãos G...
1 Município:
2 NaN
3 Possui CEAS:
4 NaN
3 4 5 6 7
0 Tipo de Busca: Rede Socioassistencial Órgãos G... NaN NaN NaN NaN
1 Selecionar ABATIA ADRIANOPOLIS AGUDOS DO SU... NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 Todas Com CEAS Sem CEAS NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN , 0
0 ACESSAR AREA RESTRITA - Sr. Gestor, clique aqu...
1 Versão 3.14.4 © 2008 Ministério do Desenvolvim...]
As I reported, I'm still practicing, so I must certainly be forgetting some detail or using an option that is not the most suitable.
Thanks if anyone has any alternatives to this issue.
CodePudding user response:
To extract the table use code bellow. What I fixed:
- Pass payload to
requests.post
as form data (not as url param) - Extract only one
table#entidadeList
from html (I usedBeautiful Soup
for this)
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://aplicacoes.mds.gov.br/cadsuas/pesquisarConsultaExterna.html'
payload = {
"consultaExternaHelper.tipoBusca": "ent",
"consultaExternaHelper.endereco.municipio.uf.sigla": "PR",
"consultaExternaHelper.endereco.municipio.id": "963",
"consultaExternaHelper.cpfcnpj": "",
"consultaExternaHelper.nomeEntidade": "",
"consultaExternaHelper.tipoEntidade.id": "05",
"consultaExternaHelper.possuiCeas": "0"
}
response = requests.post(url, data=payload)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table", id="entidadeList")
df = pd.read_html(str(table))[0]
print(df)
Outputs:
Cnpj Nome Nº Identificador UF Município
0 NaN CRAS FERRARIA - LINDAMIR TORRES 41042001534 PR CAMPO LARGO
1 NaN CRAS JARDIM MELIANE 41042003954 PR CAMPO LARGO
2 NaN CRAS RIVABEM - LOLA ANDREASSA 41042035547 PR CAMPO LARGO
3 NaN CRAS POPULAR NOVA - DURVAL WEBER 41042039531 PR CAMPO LARGO