My name is João, im a law student from Brazil and im new to this. Im trying to web scrape this page for a week to help me with the Undergraduate thesis and other researchers.
I want make a csv file with all the results from a research in a court
I'm working with python on google colab and I've been trying many ways to scrape but it did not work well. My most complete approach was when I tried to adapt a product scrape tutorial: video and corespondent code in Github.
My adaptation does not work in colab, it neither results in a error message, nor in a csv file. In the following code, I identified some problems in the adaptation by comparing the pages and the lesson, they are:
While extracting the result html out of one of the 41 pages, I believe I should create a list results html extracted, but it extracted the text too and I'm not sure how to correct it.
While trying to extract the data from the result html, I fail. Whenever I tried to create a list with these it only returned me one result.
Beyond the tutorial, I would also like to extract data from the second table in the results html, it would be the link to the oldest "relatório/voto" and its date and the link to oldest "acórdão" and its date. I'm no sure how and when in the code i should do that.
ADAPTED CODE
from requests_html import HTMLSession
import csv
s = HTMLSession()
# STEP 01: take the result html
def get_results_links(page):
url = f"https://www.tce.sp.gov.br/jurisprudencia/pesquisar?txtTdPalvs=município pessoal 37&txtExp=temporari&txtQqUma=admissão contratação&txtNenhPalvs=&txtNumIni=&txtNumFim=&tipoBuscaTxt=Documento&_tipoBuscaTxt=on&quantTrechos=1&processo=&exercicio=&dataAutuacaoInicio=&dataAutuacaoFim=&dataPubInicio=01/01/2021&dataPubFim=31/12/2021&_relator=1&_auditor=1&_materia=1&tipoDocumento=2&_tipoDocumento=1&acao=Executa&offset={page}"
links = []
r = s.get(url)
results = r.html.find('td.small a')
for item in results:
links.append(item.find('a', first=True).attrs['href']) #Problem 01: I believe it should creat a list of the results html extracted out the page, but it extracted the text too.
return links
# STEP 02: extracting relevant information from the result html before extracted
def parse_result(url):
r = s.get(url)
numero = r.html.find('td.small', first=True).text.strip()
data_autuacao = r.html.find('td.small', first=True).text.strip()
try:
parte_1 = r.html.find('td.small', first=True).text.strip()
except AttributeError as err:
sku = 'Não há'
try:
parte_2 = r.html.find('td.small', first=True).text.strip()
except AttributeError as err:
parte_2 = 'Não há'
materia = r.html.find('td.small', first=True).text.strip()
exercicio = r.html.find('td.small', first=True).text.strip()
objeto = r.html.find('td.small', first=True).text.strip()
relator = r.html.find('td.small', first=True).text.strip()
#Problem 02
# STEP 03: creating a list based objetcs created before
product = {
'Nº do Processo': numero,
"Link do Processo" : r,
'Data de Autuação': data_autuacao,
'Parte 1': parte_1,
'Parte 2': parte_2,
'Exercício': exercicio,
'Matéria' : materia,
'Objeto' : objeto,
'Relator' : relator
#'Relatório/Voto' :
#'Data Relatório/Voto' :
#'Acórdão' :
#'Data Acórdão' :
}#Problem 03
return product
# STEP 04: saving as csv
def save_csv(final):
keys = final [0].keys()
with open('products.csv', 'w') as f:
dict_writer = csv.DictWriter(f, keys)
dict_writer.writeheader()
dict_writer.writerows(final)
# STEP 05: main - joinning the functions
def main():
final = []
for x in range(0, 410, 10):
print('Getting Page ', x)
urls = get_results_links(x)
for url in urls:
final.append(parse_result(url))
print('Total: ', len(final))
save_csv(final)
Thank you, @shelter, for your help so far. I tryed to specify it.
CodePudding user response:
There are better (albeit more complex) ways of obtaining that information, like scrapy, or an async solution. Nonetheless, here is one way of getting that information you're after, as well as saving it into a csv file. I only scraped the first 2 pages (20 results), you can increase the range if you wish:
from bs4 import BeautifulSoup as bs
import requests
from tqdm.notebook import tqdm
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
big_list = []
detailed_list = []
for x in tqdm(range(0, 20, 10)):
url = f'https://www.tce.sp.gov.br/jurisprudencia/pesquisar?txtTdPalvs=município pessoal 37&txtExp=temporari&txtQqUma=admissão contratação&txtNenhPalvs=&txtNumIni=&txtNumFim=&tipoBuscaTxt=Documento&_tipoBuscaTxt=on&quantTrechos=1&processo=&exercicio=&dataAutuacaoInicio=&dataAutuacaoFim=&dataPubInicio=01/01/2021&dataPubFim=31/12/2021&_relator=1&_auditor=1&_materia=1&tipoDocumento=2&_tipoDocumento=1&acao=Executa&offset={x}'
r = s.get(url)
urls = bs(r.text, 'html.parser').select('tr[] td:nth-of-type(2) a')
big_list.extend(['https://www.tce.sp.gov.br/jurisprudencia/' x.get('href') for x in urls])
for x in tqdm(big_list):
r = s.get(x)
soup = bs(r.text, 'html.parser')
n_proceso = soup.select_one('td:-soup-contains("N° Processo:")').find_next('td').text if soup.select('td:-soup-contains("N° Processo:")') else None
link_proceso = x
autoacao = soup.select_one('td:-soup-contains("Autuação:")').find_next('td').text if soup.select('td:-soup-contains("Autuação:")') else None
parte_1 = soup.select_one('td:-soup-contains("Parte 1:")').find_next('td').text if soup.select('td:-soup-contains("Parte 1:")') else None
parte_2 = soup.select_one('td:-soup-contains("Parte 2:")').find_next('td').text if soup.select('td:-soup-contains("Parte 2:")') else None
materia = soup.select_one('td:-soup-contains("Matéria:")').find_next('td').text if soup.select('td:-soup-contains("Matéria:")') else None
exercicio = soup.select_one('td:-soup-contains("Exercício:")').find_next('td').text if soup.select('td:-soup-contains("Exercício:")') else None
objeto = soup.select_one('td:-soup-contains("Objeto:")').find_next('td').text if soup.select('td:-soup-contains("Objeto:")') else None
relator = soup.select_one('td:-soup-contains("Relator:")').find_next('td').text if soup.select('td:-soup-contains("Relator:")') else None
relatorio_voto = soup.select_one('td:-soup-contains("Relatório / Voto ")').find_previous('a').get('href') if soup.select('td:-soup-contains("Relatório / Voto")') else None
data_relatorio = soup.select_one('td:-soup-contains("Relatório / Voto ")').find_previous('td').text if soup.select('td:-soup-contains("Relatório / Voto")') else None
acordao = soup.select_one('td:-soup-contains("Acórdão ")').find_previous('a').get('href') if soup.select('td:-soup-contains("Acórdão ")') else None
data_acordao = soup.select_one('td:-soup-contains("Acórdão ")').find_previous('td').text if soup.select('td:-soup-contains("Acórdão ")') else None
detailed_list.append((n_proceso, link_proceso, autoacao, parte_1, parte_2,
materia, exercicio, objeto, relator, relatorio_voto,
data_relatorio, acordao, data_acordao))
detailed_df = pd.DataFrame(detailed_list, columns = ['n_proceso', 'link_proceso', 'autoacao', 'parte_1',
'parte_2', 'materia', 'exercicio', 'objeto', 'relator',
'relatorio_voto', 'data_relatorio', 'acordao', 'data_acordao'])
display(detailed_df)
detailed_df.to_csv('legal_br_stuffs.csv')
Result in terminal:
100%
2/2 [00:04<00:00, 1.78s/it]
100%
20/20 [00:07<00:00, 2.56it/s]
n_proceso link_proceso autoacao parte_1 parte_2 materia exercicio objeto relator relatorio_voto data_relatorio acordao data_acordao
0 18955/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=18955/989/20&offset=0 31/07/2020 ELVES SCIARRETTA CARREIRA PREFEITURA MUNICIPAL DE BRODOWSKI RECURSO ORDINARIO 2020 Recurso Ordinário Protocolado em anexo. EDGARD CAMARGO RODRIGUES https://www2.tce.sp.gov.br/arqs_juri/pdf/801385.pdf 20/01/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/801414.pdf 20/01/2021
1 13614/989/18 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=13614/989/18&offset=0 11/06/2018 PREFEITURA MUNICIPAL DE SERRA NEGRA RECURSO ORDINARIO 2014 Recurso Ordinário ANTONIO ROQUE CITADINI https://www2.tce.sp.gov.br/arqs_juri/pdf/797986.pdf 05/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/800941.pdf 05/02/2021
2 6269/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=6269/989/19&offset=0 19/02/2019 PREFEITURA MUNICIPAL DE TREMEMBE ADMISSAO DE PESSOAL - CONCURSO PROCESSO SELETIVO 2018 INTERESSADO: Rafael Varejão Munhos e outros. EDITAL Nº: 01/2017. CONCURSO PÚBLICO: 01/2017. None https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
3 14011/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14011/989/19&offset=0 11/06/2019 RUBENS EDUARDO DE SOUZA AROUCA PREFEITURA MUNICIPAL DE TREMEMBE RECURSO ORDINARIO 2019 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
4 14082/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14082/989/19&offset=0 12/06/2019 PREFEITURA MUNICIPAL DE TREMEMBE RECURSO ORDINARIO 2019 Recurso Ordinário nos autos do TC n° 6269.989.19 - Admissão de pessoal - Concurso Público RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
5 14238/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14238/989/19&offset=0 13/06/2019 MARCELO VAQUELI PREFEITURA MUNICIPAL DE TREMEMBE RECURSO ORDINARIO 2019 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804240.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804258.pdf 06/02/2021
6 14141/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=14141/989/20&offset=0 28/05/2020 PREFEITURA MUNICIPAL DE BIRIGUI CRISTIANO SALMEIRAO RECURSO ORDINARIO 2018 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
7 15371/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15371/989/19&offset=0 02/07/2019 PREFEITURA MUNICIPAL DE BIRIGUI ADMISSAO DE PESSOAL - TEMPO DETERMINADO 2018 INTERESSADOS: ADRIANA PEREIRA CRISTAL E OUTROS. PROCESSOS SELETIVOS/EDITAIS Nºs:002/2016, 004/2017, 05/2017, 06/2017,001/2018 e 002/2018. LEIS AUTORIZADORAS: Nº 5134/2009 e Nº 3946/2001. None https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
8 15388/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15388/989/20&offset=0 04/06/2020 MARIA ANGELICA MIRANDA FERNANDES RECURSO ORDINARIO 2018 Recurso Ordinário RENATO MARTINS COSTA https://www2.tce.sp.gov.br/arqs_juri/pdf/804259.pdf 06/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804262.pdf 06/02/2021
9 12911/989/16 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=12911/989/16&offset=0 20/07/2016 MARCELO CANDIDO DE SOUZA PREFEITURA MUNICIPAL DE SUZANO RECURSO ORDINARIO 2016 Recurso Ordinário Ref. Atos de Admissão de Pessoal - Exercício 2012. objetivando o preenchimento temporário dos cargos de Médico Cardiologista 20h, Fotógrafo, Médico Clínico Geral 20lt, Médico Gineco DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/814599.pdf 27/04/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/814741.pdf 27/04/2021
10 1735/002/11 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=1735/002/11&offset=10 22/11/2011 FUNDACAO DE APOIO AOS HOSP VETERINARIOS DA UNESP ADMISSAO DE PESSOAL - TEMPO DETERMINADO 2010 ADMISSAO DE PESSOAL POR TEMPO DETERMINADO COM CONCURSO/PROCESSO SELETIVO ANTONIO ROQUE CITADINI https://www2.tce.sp.gov.br/arqs_juri/pdf/800893.pdf 21/01/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/800969.pdf 21/01/2021
11 23494/989/18 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=23494/989/18&offset=10 20/11/2018 HAMILTON LUIS FOZ RECURSO ORDINARIO 2018 Recurso Ordinário DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/816918.pdf 13/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817317.pdf 13/05/2021
12 24496/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=24496/989/19&offset=10 25/11/2019 PREFEITURA MUNICIPAL DE LORENA RECURSO ORDINARIO 2017 Recurso Ordinário em face de sentença proferida nos autos de TC 00006265.989.19-4 DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/814660.pdf 27/04/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/814805.pdf 27/04/2021
13 17110/989/18 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=17110/989/18&offset=10 03/08/2018 JORGE ABISSAMRA PREFEITURA MUNICIPAL DE FERRAZ DE VASCONCELOS RECURSO ORDINARIO 2018 Recurso Ordinário DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/814633.pdf 27/04/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/814774.pdf 27/04/2021
14 24043/989/19 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=24043/989/19&offset=10 18/11/2019 PREFEITURA MUNICIPAL DE IRAPURU RECURSO ORDINARIO 2018 Recurso ordinário ROBSON MARINHO https://www2.tce.sp.gov.br/arqs_juri/pdf/817014.pdf 12/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817269.pdf 12/05/2021
15 2515/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=2515/989/20&offset=10 03/02/2020 PREFEITURA MUNICIPAL DE IPORANGA RECURSO ORDINARIO 2020 Recurso interposto em face da sentença proferida nos autos do TC 15791/989/19-7. ROBSON MARINHO https://www2.tce.sp.gov.br/arqs_juri/pdf/817001.pdf 12/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817267.pdf 12/05/2021
16 1891/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=1891/989/20&offset=10 24/01/2020 PREFEITURA MUNICIPAL DE IPORANGA RECURSO ORDINARIO 2020 RECURSO ORDINÁRIO DIMAS RAMALHO https://www2.tce.sp.gov.br/arqs_juri/pdf/802484.pdf 03/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/802620.pdf 03/02/2021
17 15026/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=15026/989/20&offset=10 02/06/2020 DIXON RONAN CARVALHO PREFEITURA MUNICIPAL DE PAULINIA RECURSO ORDINARIO 2018 RECURSO ORDINÁRIO ANTONIO ROQUE CITADINI https://www2.tce.sp.gov.br/arqs_juri/pdf/802648.pdf 05/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/803361.pdf 05/02/2021
18 9070/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=9070/989/20&offset=10 09/03/2020 PREFEITURA MUNICIPAL DE FLORIDA PAULISTA RECURSO ORDINARIO 2017 Recurso Ordinário ROBSON MARINHO https://www2.tce.sp.gov.br/arqs_juri/pdf/817006.pdf 12/05/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/817296.pdf 12/05/2021
19 21543/989/20 https://www.tce.sp.gov.br/jurisprudencia/exibir?proc=21543/989/20&offset=10 11/09/2020 PREFEITURA MUNICIPAL DE JERIQUARA RECURSO ORDINARIO 2020 RECURSO ORDINÁRIO SIDNEY ESTANISLAU BERALDO https://www2.tce.sp.gov.br/arqs_juri/pdf/802997.pdf 13/02/2021 https://www2.tce.sp.gov.br/arqs_juri/pdf/804511.pdf 13/02/2021
If you will need coding in your career, I strongly suggest you start building some foundational knowledge first, then try to code or adapt other code.