Home > Software engineering >  Python JSON web scraping code exception error occurs with loop but not when no loop is run
Python JSON web scraping code exception error occurs with loop but not when no loop is run

Time:10-04

I am trying to scrape data on political contributions in Boston the enter image description here

There are too many results to export--and I am learning python--so I wrote a script to try to scrape the data.

Here is my code:

# Set up---------
# load libraries
import os
import json
import requests
import pandas as pd
from pprintpp import pprint as pp
from pandas import json_normalize

# Data ingestion prep----------------
# there are just under 3.9 million observations in the data according to the OCPF website
# create a counter of numbers to generate urls
a = list(range(100, 391900, 100))

# convert the list from numeric to string values for ease of concatenation
a = [str(x) for x in a]

# Pull data------------

# create an empty object
url = []

# for each item in our list of numbers create a url
for number in a:
   url.append('https://www.ocpf.us/ReportData/GetItemsAndSummary?pageSize=100&currentIndex='   number   '&sortField=date&sortDirection=DESC&searchTypeCategory=A&recordTypeId=201&cityCode=35&startDate=01/01/2011&filerCpfId=0')

# call the urls
results = [requests.get(u) for u in url]

# write to structured json files
results_decode = map(lambda x: x.json(),results)

# output to a dataframe
df_contrib = json_normalize(results_decode,'items')

When I then call this script in a jupyter notebook or through the command line it either runs forever, or eventually I get the following error:

Exception Value: Expecting value: line 1 column 1 (char 0)

Others have noted that this is likely because I am trying to reference files that don't exist. However, when I just try to run the script on a single entry in the list of urls, e.g., the very last one, the script works.

Any advice is welcome. I also don't know if this is the best way to get the data I am looking for (i.e., individual political contributions in Boston starting from 01/01/2011).

CodePudding user response:

Somehow you managed to overcomplicate stuffs. here is one way to obtain that data:

import requests
import pandas as pd
from tqdm.notebook import tqdm

headers = {
    'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
    }
s = requests.Session()
s.headers.update(headers)
big_df = pd.DataFrame()
for x in tqdm(range(1, 1000, 100)):
    url = f'https://www.ocpf.us/ReportData/GetItemsAndSummary?pageSize=100&currentIndex={x}&sortField=date&sortDirection=DESC&searchTypeCategory=A&recordTypeId=201&cityCode=35&startDate=01/01/2011&filerCpfId=0'
    r = s.get(url)
    df = pd.json_normalize(r.json()['items'])
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
display(big_df)

Result in terminal:

RecordTypeId    cityStateZip    fullNameReverse fullNameReverseAddress  firstName   lastName    occupation  employer    principalOfficer    contributorCpfId    description SourceLink  tenderTypeId    tenderTypeDescription   isPreviousYearContribution  isTransfer  linkedRefundGuid    id  reportId    filerCpfId  filerFullNameReverse    recordTypeId    recordTypeDescription   streetAddress   city    state   zipCode date    amount  sourceLink  sourceDescription
0   201 Boston, MA 02127    McGahan, John P <b>McGahan, John P</b><br>5 Gates Street    John P  McGahan Councilor   Gavin Foundation        0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=851184'>10/3/22 Deposit Report</a>    1   Check   False   False   None    4936666 851184  13899   Koch, Thomas P. 201 Individual  5 Gates Street  Boston  MA  02127   10/3/2022   $500.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=851184'>10/3/22 Deposit Report</a>    10/3/22 Deposit Report
1   201 Boston, MA 02125    Johnson, Lucas  <b>Johnson, Lucas</b><br>49 Pleasant Street, Apt 2  Lucas   Johnson             0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850935'>10/2/22 Deposit Report</a>    3   Credit Card False   False   None    4935637 850935  15466   Heroux, Paul    201 Individual  49 Pleasant Street, Apt 2   Boston  MA  02125   10/2/2022   $25.00  <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850935'>10/2/22 Deposit Report</a>    10/2/22 Deposit Report
2   201 Boston, MA 02116    Flynn, Jean <b>Flynn, Jean</b><br>82 Commonwealth Ave   Jean    Flynn               0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850905'>10/1/22 Deposit Report</a>    1   Check   False   False   None    4935375 850905  16539   Bezanson, Alex  201 Individual  82 Commonwealth Ave Boston  MA  02116   10/1/2022   $50.00  <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850905'>10/1/22 Deposit Report</a>    10/1/22 Deposit Report
3   201 West Roxbury, MA 02132  Goldstein, Madison  <b>Goldstein, Madison</b><br>1435 Centre Street Madison Goldstein               0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850809'>10/1/22 Deposit Report</a>    3   Credit Card False   False   None    4935249 850809  15466   Heroux, Paul    201 Individual  1435 Centre Street  West Roxbury    MA  02132   10/1/2022   $25.00  <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850809'>10/1/22 Deposit Report</a>    10/1/22 Deposit Report
4   201 Boston, MA 02118    Aertsen, Guilliaem  <b>Aertsen, Guilliaem</b><br>175 West Brookline Street  Guilliaem   Aertsen Real estate Aertsen Ventures        0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850676'>9/30/22 Deposit Report</a>    3   Credit Card False   False   None    4934044 850676  14907   Diehl, Geoffrey 201 Individual  175 West Brookline Street   Boston  MA  02118   9/30/2022   $250.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850676'>9/30/22 Deposit Report</a>    9/30/22 Deposit Report
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 201 South Boston, MA 02127-1190 Beamer, Charles Michael <b>Beamer, Charles Michael</b><br>45 W 3rd St , Apt 420 Charles Michael Beamer  Marketing Director  ICF     0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a>    3   Credit Card False   False   None    4903240 849069  15710   Healey, Maura T.    201 Individual  45 W 3rd St , Apt 420   South Boston    MA  02127-1190  9/12/2022   $1,000.00   <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a>    9/12/22 Deposit Report
996 201 Brighton, MA 02135-4608 Bearak, Joseph  <b>Bearak, Joseph</b><br>168 Chiswick Rd ,  Joseph  Bearak  consultant  Joseph Bearak       0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a>    3   Credit Card False   False   None    4903241 849069  15710   Healey, Maura T.    201 Individual  168 Chiswick Rd ,   Brighton    MA  02135-4608  9/12/2022   $100.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a>    9/12/22 Deposit Report
997 201 Jamaica Plain, MA 02130-4613    Bellarose, Jessica  <b>Bellarose, Jessica</b><br>38 Wayburn Rd ,    Jessica Bellarose   Pilates Instructor  Boston Pilates      0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a>    3   Credit Card False   False   None    4903246 849069  15710   Healey, Maura T.    201 Individual  38 Wayburn Rd , Jamaica Plain   MA  02130-4613  9/12/2022   $25.00  <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a>    9/12/22 Deposit Report
998 201 Charlestown, MA 02129-2542  Benson, John R  <b>Benson, John R</b><br>26 Cedar St ,  John R  Benson  Not Employed    Not Employed        0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a>    3   Credit Card False   False   None    4903249 849069  15710   Healey, Maura T.    201 Individual  26 Cedar St ,   Charlestown MA  02129-2542  9/12/2022   $5.00   <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a>    9/12/22 Deposit Report
999 201 Boston, MA 02111    Bergstresser, Clyde <b>Bergstresser, Clyde</b><br>52 Temple Place   Clyde   Bergstresser    Attorney    Bergstresser & Pollock LLC      0       <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849876'>9/12/22 Deposit Report</a>    3   Credit Card False   False   None    4924644 849876  80079   Lawyers for Action Pol Action Comm  201 Individual  52 Temple Place Boston  MA  02111   9/12/2022   $40.00  <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849876'>9/12/22 Deposit Report</a>    9/12/22 Deposit Report

I stopped at 1k results, you can run it for full range.

For TQDM visit https://pypi.org/project/tqdm/

For Requests documentation, see https://requests.readthedocs.io/en/latest/

Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html

  • Related