I am trying to scrape data on political contributions in Boston the
There are too many results to export--and I am learning python--so I wrote a script to try to scrape the data.
Here is my code:
# Set up---------
# load libraries
import os
import json
import requests
import pandas as pd
from pprintpp import pprint as pp
from pandas import json_normalize
# Data ingestion prep----------------
# there are just under 3.9 million observations in the data according to the OCPF website
# create a counter of numbers to generate urls
a = list(range(100, 391900, 100))
# convert the list from numeric to string values for ease of concatenation
a = [str(x) for x in a]
# Pull data------------
# create an empty object
url = []
# for each item in our list of numbers create a url
for number in a:
url.append('https://www.ocpf.us/ReportData/GetItemsAndSummary?pageSize=100¤tIndex=' number '&sortField=date&sortDirection=DESC&searchTypeCategory=A&recordTypeId=201&cityCode=35&startDate=01/01/2011&filerCpfId=0')
# call the urls
results = [requests.get(u) for u in url]
# write to structured json files
results_decode = map(lambda x: x.json(),results)
# output to a dataframe
df_contrib = json_normalize(results_decode,'items')
When I then call this script in a jupyter notebook or through the command line it either runs forever, or eventually I get the following error:
Exception Value: Expecting value: line 1 column 1 (char 0)
Others have noted that this is likely because I am trying to reference files that don't exist. However, when I just try to run the script on a single entry in the list of urls, e.g., the very last one, the script works.
Any advice is welcome. I also don't know if this is the best way to get the data I am looking for (i.e., individual political contributions in Boston starting from 01/01/2011).
CodePudding user response:
Somehow you managed to overcomplicate stuffs. here is one way to obtain that data:
import requests
import pandas as pd
from tqdm.notebook import tqdm
headers = {
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
big_df = pd.DataFrame()
for x in tqdm(range(1, 1000, 100)):
url = f'https://www.ocpf.us/ReportData/GetItemsAndSummary?pageSize=100¤tIndex={x}&sortField=date&sortDirection=DESC&searchTypeCategory=A&recordTypeId=201&cityCode=35&startDate=01/01/2011&filerCpfId=0'
r = s.get(url)
df = pd.json_normalize(r.json()['items'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
display(big_df)
Result in terminal:
RecordTypeId cityStateZip fullNameReverse fullNameReverseAddress firstName lastName occupation employer principalOfficer contributorCpfId description SourceLink tenderTypeId tenderTypeDescription isPreviousYearContribution isTransfer linkedRefundGuid id reportId filerCpfId filerFullNameReverse recordTypeId recordTypeDescription streetAddress city state zipCode date amount sourceLink sourceDescription
0 201 Boston, MA 02127 McGahan, John P <b>McGahan, John P</b><br>5 Gates Street John P McGahan Councilor Gavin Foundation 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=851184'>10/3/22 Deposit Report</a> 1 Check False False None 4936666 851184 13899 Koch, Thomas P. 201 Individual 5 Gates Street Boston MA 02127 10/3/2022 $500.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=851184'>10/3/22 Deposit Report</a> 10/3/22 Deposit Report
1 201 Boston, MA 02125 Johnson, Lucas <b>Johnson, Lucas</b><br>49 Pleasant Street, Apt 2 Lucas Johnson 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850935'>10/2/22 Deposit Report</a> 3 Credit Card False False None 4935637 850935 15466 Heroux, Paul 201 Individual 49 Pleasant Street, Apt 2 Boston MA 02125 10/2/2022 $25.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850935'>10/2/22 Deposit Report</a> 10/2/22 Deposit Report
2 201 Boston, MA 02116 Flynn, Jean <b>Flynn, Jean</b><br>82 Commonwealth Ave Jean Flynn 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850905'>10/1/22 Deposit Report</a> 1 Check False False None 4935375 850905 16539 Bezanson, Alex 201 Individual 82 Commonwealth Ave Boston MA 02116 10/1/2022 $50.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850905'>10/1/22 Deposit Report</a> 10/1/22 Deposit Report
3 201 West Roxbury, MA 02132 Goldstein, Madison <b>Goldstein, Madison</b><br>1435 Centre Street Madison Goldstein 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850809'>10/1/22 Deposit Report</a> 3 Credit Card False False None 4935249 850809 15466 Heroux, Paul 201 Individual 1435 Centre Street West Roxbury MA 02132 10/1/2022 $25.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850809'>10/1/22 Deposit Report</a> 10/1/22 Deposit Report
4 201 Boston, MA 02118 Aertsen, Guilliaem <b>Aertsen, Guilliaem</b><br>175 West Brookline Street Guilliaem Aertsen Real estate Aertsen Ventures 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850676'>9/30/22 Deposit Report</a> 3 Credit Card False False None 4934044 850676 14907 Diehl, Geoffrey 201 Individual 175 West Brookline Street Boston MA 02118 9/30/2022 $250.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=850676'>9/30/22 Deposit Report</a> 9/30/22 Deposit Report
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 201 South Boston, MA 02127-1190 Beamer, Charles Michael <b>Beamer, Charles Michael</b><br>45 W 3rd St , Apt 420 Charles Michael Beamer Marketing Director ICF 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a> 3 Credit Card False False None 4903240 849069 15710 Healey, Maura T. 201 Individual 45 W 3rd St , Apt 420 South Boston MA 02127-1190 9/12/2022 $1,000.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a> 9/12/22 Deposit Report
996 201 Brighton, MA 02135-4608 Bearak, Joseph <b>Bearak, Joseph</b><br>168 Chiswick Rd , Joseph Bearak consultant Joseph Bearak 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a> 3 Credit Card False False None 4903241 849069 15710 Healey, Maura T. 201 Individual 168 Chiswick Rd , Brighton MA 02135-4608 9/12/2022 $100.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a> 9/12/22 Deposit Report
997 201 Jamaica Plain, MA 02130-4613 Bellarose, Jessica <b>Bellarose, Jessica</b><br>38 Wayburn Rd , Jessica Bellarose Pilates Instructor Boston Pilates 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a> 3 Credit Card False False None 4903246 849069 15710 Healey, Maura T. 201 Individual 38 Wayburn Rd , Jamaica Plain MA 02130-4613 9/12/2022 $25.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a> 9/12/22 Deposit Report
998 201 Charlestown, MA 02129-2542 Benson, John R <b>Benson, John R</b><br>26 Cedar St , John R Benson Not Employed Not Employed 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a> 3 Credit Card False False None 4903249 849069 15710 Healey, Maura T. 201 Individual 26 Cedar St , Charlestown MA 02129-2542 9/12/2022 $5.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849069'>9/12/22 Deposit Report</a> 9/12/22 Deposit Report
999 201 Boston, MA 02111 Bergstresser, Clyde <b>Bergstresser, Clyde</b><br>52 Temple Place Clyde Bergstresser Attorney Bergstresser & Pollock LLC 0 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849876'>9/12/22 Deposit Report</a> 3 Credit Card False False None 4924644 849876 80079 Lawyers for Action Pol Action Comm 201 Individual 52 Temple Place Boston MA 02111 9/12/2022 $40.00 <a target='_blank' href='https://www.ocpf.us/Reports/DisplayReport?menuHidden=true&id=849876'>9/12/22 Deposit Report</a> 9/12/22 Deposit Report
I stopped at 1k results, you can run it for full range.
For TQDM visit https://pypi.org/project/tqdm/
For Requests documentation, see https://requests.readthedocs.io/en/latest/
Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html