How to collect data generated from a form using python requests?-CodePudding

I'm doing some web scraping with form data, and I've run into a situation that I can't handle.

I need to get a table that is generated from a form with some options, as shown in the image below:

The website is this: https://aplicacoes.mds.gov.br/cadsuas/pesquisarConsultaExterna.html

For this, I tried to develop a small script, according to the code below:

import pandas as pd
import requests

tipo = 'Rede Socioassistencial'
uf = 'PR'
municipio = 'Campo Largo'

url = 'https://aplicacoes.mds.gov.br/cadsuas/pesquisarConsultaExterna.html'
payload = {
'consultaExternaHelper.tipoBusca':'%s' %tipo,
'consultaExternaHelper.endereco.municipio.uf.sigla': '%s' %uf,
'consultaExternaHelper.endereco.municipio.id': '%s' %municipio}

response = requests.post(url, params=payload)
df = pd.read_html(response.text)

However, I have no experience with this type of application and, therefore, the result obtained is far from what was expected, as can be seen:

[                                                   0          1          2  \
0                                         Bem vindo!        NaN        NaN   
1  O CadSUAS é o sistema de cadastro do SUAS, que...        NaN        NaN   
2                                          PESQUISAR  PESQUISAR  PESQUISAR   

           3          4          5  
0        NaN        NaN        NaN  
1        NaN        NaN        NaN  
2  PESQUISAR  PESQUISAR  PESQUISAR  ,                                                    0  \
0  Tipo de Busca: Rede Socioassistencial Órgãos G...   
1                                              * UF:   
2                                                CPF   
3                                              Tipo:   
4                                                NaN   

                                                   1  \
0  Tipo de Busca: Rede Socioassistencial Órgãos G...   
1  Selecionar  AC  AL  AM  AP  BA  CE  DF  ES  GO...   
2                                              Nome:   
3                            Selecionar  CRAS  CREAS   
4                                                NaN   

                                                   2  \
0  Tipo de Busca: Rede Socioassistencial Órgãos G...   
1                                         Município:   
2                                                NaN   
3                                       Possui CEAS:   
4                                                NaN   

                                                   3   4   5   6   7  
0  Tipo de Busca: Rede Socioassistencial Órgãos G... NaN NaN NaN NaN  
1  Selecionar  ABATIA  ADRIANOPOLIS  AGUDOS DO SU... NaN NaN NaN NaN  
2                                                NaN NaN NaN NaN NaN  
3                          Todas  Com CEAS  Sem CEAS NaN NaN NaN NaN  
4                                                NaN NaN NaN NaN NaN  ,                                                    0
0  ACESSAR AREA RESTRITA - Sr. Gestor, clique aqu...
1  Versão 3.14.4 © 2008 Ministério do Desenvolvim...]

As I reported, I'm still practicing, so I must certainly be forgetting some detail or using an option that is not the most suitable.

Thanks if anyone has any alternatives to this issue.

CodePudding user response：

To extract the table use code bellow. What I fixed:

Pass payload to requests.post as form data (not as url param)
Extract only one table#entidadeList from html (I used Beautiful Soup for this)

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://aplicacoes.mds.gov.br/cadsuas/pesquisarConsultaExterna.html'

payload = {
    "consultaExternaHelper.tipoBusca": "ent",
    "consultaExternaHelper.endereco.municipio.uf.sigla": "PR",
    "consultaExternaHelper.endereco.municipio.id": "963",
    "consultaExternaHelper.cpfcnpj": "",
    "consultaExternaHelper.nomeEntidade": "",
    "consultaExternaHelper.tipoEntidade.id": "05",
    "consultaExternaHelper.possuiCeas": "0"
}

response = requests.post(url, data=payload)
response.raise_for_status()

soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table", id="entidadeList")

df = pd.read_html(str(table))[0]

print(df)

Outputs:

   Cnpj                              Nome  Nº Identificador  UF    Município
0   NaN   CRAS FERRARIA - LINDAMIR TORRES       41042001534  PR  CAMPO LARGO
1   NaN               CRAS JARDIM MELIANE       41042003954  PR  CAMPO LARGO
2   NaN     CRAS RIVABEM - LOLA ANDREASSA       41042035547  PR  CAMPO LARGO
3   NaN  CRAS POPULAR NOVA - DURVAL WEBER       41042039531  PR  CAMPO LARGO