Home > Enterprise >  How to web scrap this page and turn it into a csv file?
How to web scrap this page and turn it into a csv file?

Time:11-26

My name is João, im a law student from Brazil and im new to this. Im trying to web scrape this page for a week to help me with the Undergraduate thesis and other researchers.

I want make a csv file with all the results from a research in a court this image

I'm working with python on google colab and I've been trying many ways to scrape but it did not work well. My most complete approach was when I tried to adapt a product scrape tutorial: video and corespondent code in Github.

My adaptation does not work in colab, it neither results in a error message, nor in a csv file. In the following code, I identified some problems in the adaptation by comparing the pages and the lesson, they are:

  1. While extracting the result html out of one of the 41 pages, I believe I should create a list results html extracted, but it extracted the text too and I'm not sure how to correct it.

  2. While trying to extract the data from the result html, I fail. Whenever I tried to create a list with these it only returned me one result.

  3. Beyond the tutorial, I would also like to extract data from the second table in the results html, it would be the link to the oldest "relatório/voto" and its date and the link to oldest "acórdão" and its date. I'm no sure how and when in the code i should do that.

ADAPTED CODE

from requests_html import HTMLSession
import csv

s = HTMLSession()

# STEP 01: take the result html

def get_results_links(page):
  url = f"https://www.tce.sp.gov.br/jurisprudencia/pesquisar?txtTdPalvs=município pessoal 37&txtExp=temporari&txtQqUma=admissão contratação&txtNenhPalvs=&txtNumIni=&txtNumFim=&tipoBuscaTxt=Documento&_tipoBuscaTxt=on&quantTrechos=1&processo=&exercicio=&dataAutuacaoInicio=&dataAutuacaoFim=&dataPubInicio=01/01/2021&dataPubFim=31/12/2021&_relator=1&_auditor=1&_materia=1&tipoDocumento=2&_tipoDocumento=1&acao=Executa&offset={page}"
  links = []
  r = s.get(url)
  results = r.html.find('td.small a')
  for item in results:
    links.append(item.find('a', first=True).attrs['href']) #Problem 01: I believe it should creat a list of the results html extracted out the page, but it extracted the text too.
  return links

# STEP 02: extracting relevant information from the result html before extracted

def parse_result(url):
    r = s.get(url)
    numero = r.html.find('td.small', first=True).text.strip()
    data_autuacao = r.html.find('td.small', first=True).text.strip()
    try:
      parte_1 = r.html.find('td.small', first=True).text.strip()
    except AttributeError as err:
      sku = 'Não há'
    try:
      parte_2 = r.html.find('td.small', first=True).text.strip()
    except AttributeError as err:
      parte_2 = 'Não há'
    materia = r.html.find('td.small', first=True).text.strip()
    exercicio = r.html.find('td.small', first=True).text.strip()
    objeto = r.html.find('td.small', first=True).text.strip()
    relator = r.html.find('td.small', first=True).text.strip()
    #Problem 02
# STEP 03: creating a list based objetcs created before
    product = {
        'Nº do Processo': numero,
        "Link do Processo" : r,
        'Data de Autuação': data_autuacao,
        'Parte 1': parte_1,
        'Parte 2': parte_2,
        'Exercício': exercicio,
        'Matéria' : materia,
        'Objeto' : objeto,
        'Relator' : relator
        #'Relatório/Voto' :
        #'Data Relatório/Voto' :
        #'Acórdão' :
        #'Data Acórdão' :
    }#Problem 03
    return product

# STEP 04: saving as csv
def save_csv(final):
    keys = final [0].keys()

    with open('products.csv', 'w') as f:
        dict_writer = csv.DictWriter(f, keys)
        dict_writer.writeheader()
        dict_writer.writerows(final)

# STEP 05: main - joinning the functions
def main():
    final = []
    for x in range(0, 410, 10):
        print('Getting Page ', x)
        urls = get_results_links(x)
        for url in urls:
            final.append(parse_result(url))
        print('Total: ', len(final))
        save_csv(final)

Thank you, @shelter, for your help so far. I tryed to specify it.

CodePudding user response:

There are better (albeit more complex) ways of obtaining that information, like scrapy, or an async solution. Nonetheless, here is one way of getting that information you're after, as well as saving it into a csv file. I only scraped the first 2 pages (20 results), you can increase the range if you wish:

from bs4 import BeautifulSoup as bs
import requests
from tqdm.notebook import tqdm
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

s = requests.Session()
s.headers.update(headers)

big_list = []
detailed_list = []
for x in tqdm(range(0, 20, 10)):
    url = f'https://www.tce.sp.gov.br/jurisprudencia/pesquisar?txtTdPalvs=município pessoal 37&txtExp=temporari&txtQqUma=admissão contratação&txtNenhPalvs=&txtNumIni=&txtNumFim=&tipoBuscaTxt=Documento&_tipoBuscaTxt=on&quantTrechos=1&processo=&exercicio=&dataAutuacaoInicio=&dataAutuacaoFim=&dataPubInicio=01/01/2021&dataPubFim=31/12/2021&_relator=1&_auditor=1&_materia=1&tipoDocumento=2&_tipoDocumento=1&acao=Executa&offset={x}'
    r = s.get(url)
    urls = bs(r.text, 'html.parser').select('tr[] td:nth-of-type(2) a')
    big_list.extend(['https://www.tce.sp.gov.br/jurisprudencia/'   x.get('href') for x in urls])
for x in tqdm(big_list):
    r = s.get(x)
    soup = bs(r.text, 'html.parser')
    n_proceso = soup.select_one('td:-soup-contains("N° Processo:")').find_next('td').text if soup.select('td:-soup-contains("N° Processo:")') else None
    link_proceso = x
    autoacao = soup.select_one('td:-soup-contains("Autuação:")').find_next('td').text if soup.select('td:-soup-contains("Autuação:")') else None
    parte_1 = soup.select_one('td:-soup-contains("Parte 1:")').find_next('td').text if soup.select('td:-soup-contains("Parte 1:")') else None
    parte_2 = soup.select_one('td:-soup-contains("Parte 2:")').find_next('td').text if soup.select('td:-soup-contains("Parte 2:")') else None
    materia = soup.select_one('td:-soup-contains("Matéria:")').find_next('td').text if soup.select('td:-soup-contains("Matéria:")') else None
    exercicio = soup.select_one('td:-soup-contains("Exercício:")').find_next('td').text if soup.select('td:-soup-contains("Exercício:")') else None
    objeto = soup.select_one('td:-soup-contains("Objeto:")').find_next('td').text if soup.select('td:-soup-contains("Objeto:")') else None
    relator = soup.select_one('td:-soup-contains("Relator:")').find_next('td').text if soup.select('td:-soup-contains("Relator:")') else None
    relatorio_voto = soup.select_one('td:-soup-contains("Relatório / Voto ")').find_previous('a').get('href') if soup.select('td:-soup-contains("Relatório / Voto")') else None
    data_relatorio = soup.select_one('td:-soup-contains("Relatório / Voto ")').find_previous('td').text if soup.select('td:-soup-contains("Relatório / Voto")') else None
    acordao = soup.select_one('td:-soup-contains("Acórdão ")').find_previous('a').get('href') if soup.select('td:-soup-contains("Acórdão ")') else None
    data_acordao = soup.select_one('td:-soup-contains("Acórdão ")').find_previous('td').text if soup.select('td:-soup-contains("Acórdão ")') else None
    detailed_list.append((n_proceso, link_proceso, autoacao, parte_1, parte_2, 
                          materia, exercicio, objeto, relator, relatorio_voto, 
                          data_relatorio, acordao, data_acordao))
detailed_df = pd.DataFrame(detailed_list, columns = ['n_proceso', 'link_proceso', 'autoacao', 'parte_1', 
                                                     'parte_2', 'materia', 'exercicio', 'objeto', 'relator', 
                                                     'relatorio_voto', 'data_relatorio', 'acordao', 'data_acordao'])
display(detailed_df) 
detailed_df.to_csv('legal_br_stuffs.csv')

Result in terminal:

100%
2/2 [00:04<00:00, 1.78s/it]
100%
20/20 [00:07<00:00, 2.56it/s]
n_proceso   link_proceso    autoacao    parte_1 parte_2 materia exercicio   objeto  relator relatorio_voto  data_relatorio  acordao data_acordao
0   18955/989/20    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=18955/989/20&offset=0  31/07/2020  ELVES SCIARRETTA CARREIRA   PREFEITURA MUNICIPAL DE BRODOWSKI   RECURSO ORDINARIO   2020    Recurso Ordinário Protocolado em anexo. EDGARD CAMARGO RODRIGUES    https://www2.tce.sp.gov.br/arqs_juri/pdf/801385.pdf 20/01/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/801414.pdf 20/01/2021
1   13614/989/18    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=13614/989/18&offset=0  11/06/2018  PREFEITURA MUNICIPAL DE SERRA NEGRA     RECURSO ORDINARIO   2014    Recurso Ordinário   ANTONIO ROQUE CITADINI  https://www2.tce.sp.gov.br/arqs_juri/pdf/797986.pdf 05/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/800941.pdf 05/02/2021
2   6269/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=6269/989/19&offset=0   19/02/2019  PREFEITURA MUNICIPAL DE TREMEMBE        ADMISSAO DE PESSOAL - CONCURSO PROCESSO SELETIVO    2018    INTERESSADO: Rafael Varejão Munhos e outros. EDITAL Nº: 01/2017. CONCURSO PÚBLICO: 01/2017. None    https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
3   14011/989/19    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14011/989/19&offset=0  11/06/2019  RUBENS EDUARDO DE SOUZA AROUCA  PREFEITURA MUNICIPAL DE TREMEMBE    RECURSO ORDINARIO   2019    Recurso Ordinário   RENATO MARTINS COSTA    https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
4   14082/989/19    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14082/989/19&offset=0  12/06/2019  PREFEITURA MUNICIPAL DE TREMEMBE        RECURSO ORDINARIO   2019    Recurso Ordinário nos autos do TC n° 6269.989.19 - Admissão de pessoal - Concurso Público   RENATO MARTINS COSTA    https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
5   14238/989/19    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14238/989/19&offset=0  13/06/2019  MARCELO VAQUELI PREFEITURA MUNICIPAL DE TREMEMBE    RECURSO ORDINARIO   2019    Recurso Ordinário   RENATO MARTINS COSTA    https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
6   14141/989/20    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14141/989/20&offset=0  28/05/2020  PREFEITURA MUNICIPAL DE BIRIGUI CRISTIANO SALMEIRAO RECURSO ORDINARIO   2018    Recurso Ordinário   RENATO MARTINS COSTA    https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
7   15371/989/19    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15371/989/19&offset=0  02/07/2019  PREFEITURA MUNICIPAL DE BIRIGUI     ADMISSAO DE PESSOAL - TEMPO DETERMINADO 2018    INTERESSADOS: ADRIANA PEREIRA CRISTAL E OUTROS. PROCESSOS SELETIVOS/EDITAIS Nºs:002/2016, 004/2017, 05/2017, 06/2017,001/2018 e 002/2018. LEIS AUTORIZADORAS: Nº 5134/2009 e Nº 3946/2001.  None    https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
8   15388/989/20    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15388/989/20&offset=0  04/06/2020  MARIA ANGELICA MIRANDA FERNANDES        RECURSO ORDINARIO   2018    Recurso Ordinário   RENATO MARTINS COSTA    https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
9   12911/989/16    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=12911/989/16&offset=0  20/07/2016  MARCELO CANDIDO DE SOUZA    PREFEITURA MUNICIPAL DE SUZANO  RECURSO ORDINARIO   2016    Recurso Ordinário Ref. Atos de Admissão de Pessoal - Exercício 2012. objetivando o preenchimento temporário dos cargos de Médico Cardiologista 20h, Fotógrafo, Médico Clínico Geral 20lt, Médico Gineco DIMAS RAMALHO   https://www2.tce.sp.gov.br/arqs_juri/pdf/814599.pdf 27/04/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/814741.pdf 27/04/2021
10  1735/002/11 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=1735/002/11&offset=10  22/11/2011  FUNDACAO DE APOIO AOS HOSP VETERINARIOS DA UNESP        ADMISSAO DE PESSOAL - TEMPO DETERMINADO 2010    ADMISSAO DE PESSOAL POR TEMPO DETERMINADO COM CONCURSO/PROCESSO SELETIVO    ANTONIO ROQUE CITADINI  https://www2.tce.sp.gov.br/arqs_juri/pdf/800893.pdf 21/01/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/800969.pdf 21/01/2021
11  23494/989/18    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=23494/989/18&offset=10 20/11/2018  HAMILTON LUIS FOZ       RECURSO ORDINARIO   2018    Recurso Ordinário   DIMAS RAMALHO   https://www2.tce.sp.gov.br/arqs_juri/pdf/816918.pdf 13/05/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/817317.pdf 13/05/2021
12  24496/989/19    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=24496/989/19&offset=10 25/11/2019  PREFEITURA MUNICIPAL DE LORENA      RECURSO ORDINARIO   2017    Recurso Ordinário em face de sentença proferida nos autos de TC 00006265.989.19-4   DIMAS RAMALHO   https://www2.tce.sp.gov.br/arqs_juri/pdf/814660.pdf 27/04/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/814805.pdf 27/04/2021
13  17110/989/18    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=17110/989/18&offset=10 03/08/2018  JORGE ABISSAMRA PREFEITURA MUNICIPAL DE FERRAZ DE VASCONCELOS   RECURSO ORDINARIO   2018    Recurso Ordinário   DIMAS RAMALHO   https://www2.tce.sp.gov.br/arqs_juri/pdf/814633.pdf 27/04/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/814774.pdf 27/04/2021
14  24043/989/19    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=24043/989/19&offset=10 18/11/2019  PREFEITURA MUNICIPAL DE IRAPURU     RECURSO ORDINARIO   2018    Recurso ordinário   ROBSON MARINHO  https://www2.tce.sp.gov.br/arqs_juri/pdf/817014.pdf 12/05/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/817269.pdf 12/05/2021
15  2515/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=2515/989/20&offset=10  03/02/2020  PREFEITURA MUNICIPAL DE IPORANGA        RECURSO ORDINARIO   2020    Recurso interposto em face da sentença proferida nos autos do TC 15791/989/19-7.    ROBSON MARINHO  https://www2.tce.sp.gov.br/arqs_juri/pdf/817001.pdf 12/05/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/817267.pdf 12/05/2021
16  1891/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=1891/989/20&offset=10  24/01/2020  PREFEITURA MUNICIPAL DE IPORANGA        RECURSO ORDINARIO   2020    RECURSO ORDINÁRIO   DIMAS RAMALHO   https://www2.tce.sp.gov.br/arqs_juri/pdf/802484.pdf 03/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/802620.pdf 03/02/2021
17  15026/989/20    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15026/989/20&offset=10 02/06/2020  DIXON RONAN CARVALHO    PREFEITURA MUNICIPAL DE PAULINIA    RECURSO ORDINARIO   2018    RECURSO ORDINÁRIO   ANTONIO ROQUE CITADINI  https://www2.tce.sp.gov.br/arqs_juri/pdf/802648.pdf 05/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/803361.pdf 05/02/2021
18  9070/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=9070/989/20&offset=10  09/03/2020  PREFEITURA MUNICIPAL DE FLORIDA PAULISTA        RECURSO ORDINARIO   2017    Recurso Ordinário   ROBSON MARINHO  https://www2.tce.sp.gov.br/arqs_juri/pdf/817006.pdf 12/05/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/817296.pdf 12/05/2021
19  21543/989/20    https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=21543/989/20&offset=10 11/09/2020  PREFEITURA MUNICIPAL DE JERIQUARA       RECURSO ORDINARIO   2020    RECURSO ORDINÁRIO   SIDNEY ESTANISLAU BERALDO   https://www2.tce.sp.gov.br/arqs_juri/pdf/802997.pdf 13/02/2021  https://www2.tce.sp.gov.br/arqs_juri/pdf/804511.pdf 13/02/2021

If you will need coding in your career, I strongly suggest you start building some foundational knowledge first, then try to code or adapt other code.

  • Related