Home > Blockchain >  How do I extract a dataframe from a website
How do I extract a dataframe from a website

Time:09-23

The website (poder360.com.br/banco-de-dados) has a lot of filters that generate a dataframe, based on what you selected on those filters. I'm trying to extract this dataframe on python, but I can't figure it out what to do on the request to achieve this.

This question is related to a previous question I've asked before:

How to find the correct API of a website?

CodePudding user response:

I just tested with some random data and checked the network request in Firefox and when you click on Pesquisar this request is sent:

https://pesquisas.poder360.com.br/web/consulta/fetch?unidades_federativas_id=15&regioes_id=2&cargos_id=2&institutos_id=3&data_pesquisa_de=2021-09-22&data_pesquisa_ate=2021-09-23&turno=T&tipo_id=T&candidatos_id=1&order_column=ano&order_type=asc

Of course it didn't show any data and it's not easy task because you have to know all the ids from each param

Also you should contact them to see if they can provide some API to easy access

CodePudding user response:

What you can do is to call each of the URLs for the fields on the page, e.g. for Unidade Federativa, the URL to get the dataframe is https://pesquisas.poder360.com.br/web/regiao/enum?q=

This will return a JSON:

{"current_page":1,"data":[{"id":1,"descricao":"Regi\u00e3o Centro-Oeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Centro-Oeste"},{"id":2,"descricao":"Regi\u00e3o Nordeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Nordeste"},{"id":3,"descricao":"Regi\u00e3o Norte","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Norte"},{"id":4,"descricao":"Regi\u00e3o Sudeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Sudeste"},{"id":5,"descricao":"Regi\u00e3o Sul","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Sul"},{"id":6,"descricao":"Nacional","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Nacional"},{"id":7,"descricao":"Regional","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regional"},{"id":8,"descricao":"Regi\u00e3o Norte_Centro-Oeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Norte_Centro-Oeste"},{"id":9,"descricao":"Municipal","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Municipal"}],"first_page_url":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum?page=1","from":1,"last_page":1,"last_page_url":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum?page=1","next_page_url":"null","path":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum","per_page":15,"prev_page_url":"null","to":9,"total":9}

Note that I have replaced all the null with "null" so it will not trigger an error in Python.

You can then extract the data using the following code:

res = {"current_page":1,"data":[{"id":1,"descricao":"Regi\u00e3o Centro-Oeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Centro-Oeste"},{"id":2,"descricao":"Regi\u00e3o Nordeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Nordeste"},{"id":3,"descricao":"Regi\u00e3o Norte","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Norte"},{"id":4,"descricao":"Regi\u00e3o Sudeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Sudeste"},{"id":5,"descricao":"Regi\u00e3o Sul","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Sul"},{"id":6,"descricao":"Nacional","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Nacional"},{"id":7,"descricao":"Regional","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regional"},{"id":8,"descricao":"Regi\u00e3o Norte_Centro-Oeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Norte_Centro-Oeste"},{"id":9,"descricao":"Municipal","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Municipal"}],"first_page_url":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum?page=1","from":1,"last_page":1,"last_page_url":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum?page=1","next_page_url":"null","path":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum","per_page":15,"prev_page_url":"null","to":9,"total":9}
for i in range(0,len(res['data'])):
    print(res['data'][i]['descricao'])

This will return the output:

Região Centro-Oeste
Região Nordeste
Região Norte
Região Sudeste
Região Sul
Nacional
Regional
Região Norte_Centro-Oeste
Municipal

Now you just need to put all the relevant URLs in a list and run the same script over it.

  • Related