I've been trying to fetch all US Zipcodes for a web scraping project for my company. I'm trying to use uszipcode library for doing it automatically rather than manually from the website im intersted in but cant figure it out.
this is my manual attempt:
from bs4 import BeautifulSoup
import requests
url = 'https://www.unitedstateszipcodes.org'
headers = {'User-Agent': 'Chrome/50.0.2661.102'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
all_zipcodes = []
# Extract all
for data in soup.find_all('div', class_='state-list'):
for a in data.find_all('a'):
if a is not None:
hrefs.append(a.get('href'))
hrefs.remove(None)
def get_zipcode_list():
"""
get_zipcode_list gets the GET response from the web archives server using CDX API
:return: CDX API output in json format.
"""
for state in hrefs:
state_url = url state
state_page = requests.get(state_url, headers=headers)
states_soup = BeautifulSoup(state_page.text, 'html.parser')
div = states_soup.find(class_='list-group')
for a in div.findAll('a'):
if str(a.string).isdigit():
all_zipcodes.append(a.string)
return all_zipcodes
This takes alot of time and would like to know how to do the same in more efficient way using uszipcodes
CodePudding user response:
You may try to search by pattern ''
s = SearchEngine()
l = s.by_pattern('', returns=1000000)
print(len(l))
More details in docs and in their basic tutorial
CodePudding user response:
The regex that zip code in US have is [0-9]{5}(?:-[0-9]{4})?
you can simply check with re module
import re
regex = r"[0-9]{5}(?:-[0-9]{4})?"
if re.match(zipcode, regex):
print("match")
else:
print("not a match")
CodePudding user response:
You can download the list of zip codes as a csv from the official source(42k rows) and then parse it if its for one-time use and you don't need any other metadata associated with each of the zip codes like the one which uszipcodes provides.
The uszipcodes also has another database which is quite big and should have all the data you need.
from uszipcode import SearchEngine
zipSearch = SearchEngine(simple_zipcode=False)
allZipCodes = zipSearch.by_pattern('', returns=200000)
print(len(allZipCodes)