Home > Blockchain >  cleaning up web scrape data and combining together?
cleaning up web scrape data and combining together?

Time:06-02

The website url is https://www.justia.com/lawyers/criminal-law/maine

I'm wanting to scrape only the name of the lawyer and where their office is.

response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
Lawyer_name= soup.find_all("a","url main-profile-link")
 for i in Lawyer_name:
     print(i.find(text=True))
address= soup.find_all("span","-address -hide-landscape-tablet")
for x in address:
    print(x.find_all(text=True))

The name prints out just find but the address is printing off with extra that I want to remove:

['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t88 Hammond Street', '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBangor,\t\t\t\t\tME 04401\t\t\t\t\t\t    ']

so the output I'm attempting to get for each lawyer is like this (the 1st one example):

Hunter J Tzovarras
88 Hammond Street
Bangor, ME 04401

two issues I'm trying to figure out

  1. How can I clean up the address so it is easier to read?
  1. How can I save the matching lawyer name with the address so they don't get mixed up.

CodePudding user response:

for your second query You can save them into a dictionary like this -

url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")

# parse all names and save them in a list
lawyer_names = soup.find_all("a","url main-profile-link")
lawyer_names = [name.find(text=True).strip() for name in lawyer_names]

# parse all addresses and save them in a list
lawyer_addresses = soup.find_all("span","-address -hide-landscape-tablet")
lawyer_addresses = [re.sub('\s ',' ', address.get_text(strip=True)) for address in lawyer_addresses]

# map names with addresses
lawyer_dict = dict(zip(lawyer_names, lawyer_addresses))

print(lawyer_dict)

Output dictionary -

{'Albert Hansen': '62 Portland Rd., Ste. 44Kennebunk, ME 04043',
 'Amber Lynn Tucker': '415 Congress St., Ste. 202P.O. Box 7542Portland, ME 04112',
 'Amy Fairfield': '10 Stoney Brook LaneLyman, ME 04002',
 'Andrews Bruce Campbell Esq': '919 Ridge RoadP.O. BOX 119Bowdoinham, ME 04008',
 'Bradford Pattershall Esq': 'Two Canal PlazaPO Box 4600Portland, ME 04112',
 'Christopher Causey Esq': '949 Main StreetSanford, ME 04073',
 'Cory McKenna': '75 Pearl St.Suite 216Portland, ME 04101',
 'David G. Webbert': '160 Capitol StreetP.O. Box 79Augusta, ME 04332',
 'David Nelson Wood Esq': '120 Main StreetSuite 110Saco, ME 04072',
 'Dylan R. Boyd': '6 City CenterSuite 301Portland, ME 04101',
 'Gregory LeClerc': '36 Ossipee Trl W.Standish, ME 04084',
 'Hunter J Tzovarras': '88 Hammond StreetBangor, ME 04401',
 'John S. Webb': '16 Middle StSaco, ME 04072',
 'John Simpson': '5 Island View DrCumberland Foreside, ME 04110',
 'Jonathan Steven Handelman Esq': '16 Union StreetBrunswick, ME 04011',
 'Luke Rioux Esq': '75 Pearl St. Suite 400Portland, ME 04101',
 'Mariah America Gleaton': '12 Silver StreetP.O. Box 559Waterville, ME 04903',
 'Meredith G. Schmid': 'PO Box 335York, ME 03909',
 'Michael Stephen Bowser Jr.': '37 Western Ave., Unit #307Kennebunk, ME 04043',
 'Michael Turndorf Esq': '415 Congress StreetSuite 202Portland, ME 04101',
 'Michele D L Kenney': '18 Market Square Suite 5Houlton, ME 04730',
 'Miklos Pongratz Esq': '76 Tandberg Trail (Route 115)Windham, ME 04062',
 'Mr. Richard Lyman Hartley': '15 Columbia Street, Ste. 301Bangor, ME 04401',
 'Neal L Weinstein Esq': '32 Saco AveOld Orchard Beach, ME 04064',
 'Peter J Cyr Esq': '85 Brackett StreetPortland, ME 04102',
 'Richard Regan': '4 Union Park RoadTopsham, ME 04086',
 'Richard Smith Berne': '482 Congress Street Suite 402Portland, ME 04101',
 'Robert Guillory Esq': '241 Main StreetP.O. Box 57Saco, ME 04072',
 'Robert Van Horn': '20 Oak StreetEllsworth, ME 04605',
 'Russell Goldsmith Esq': '647 U.S. Route One#203York, ME 03909',
 'Shelley Carter': '110 Portland StreetFryeburg, ME 04037',
 'Thaddeus Day Esq': '440 Walnut Hill RdNorth Yarmouth, ME 04097',
 'Thomas P. Elias': '28 Long Sands Road, Suite 5York, ME 03909',
 'Timothy Zerillo': '1250 Forest Avenue, Ste 3APortland, ME 04103',
 'Todd H Crawford Jr': '1288 Roosevelt Trl, Ste #3P.O. Box 753Raymond, ME 04071',
 'Walter McKee Esq': '133 State StreetAugusta, ME 04330',
 'Wayne Foote Esq': '344 Mount Hope Ave.Bangor, ME 04402',
 'Will Ashe': '192 Main StreetEllsworth, ME 04605',
 'William T. Bly Esq': '119 Main StreetKennebunk, ME 04043',
 'Zachary J. Smith': 'P.O. Box 1049304 Hancock St. Suite 1KBangor, ME 04401'}

CodePudding user response:

Use x.get_text() instead of x.find_all

for x in address:
    print(x.get_text(strip=True))

Full working code:

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")


n=[]
ad=[]
Lawyer_name= [x.get('title').strip() for x in soup.select('a.lawyer-avatar')]
n.extend(Lawyer_name)
#print(Lawyer_name)
address= [x.get_text(strip=True).replace('\t','').strip() for x in soup.find_all("span",class_="-address -hide-landscape-tablet")]
#print(address)
ad.extend(address)


df = pd.DataFrame(data=list(zip(n,ad)),columns=[['Lawyer_name','address']])

print(df)

Output:

             Lawyer_name                              address
0              William T. Bly Esq                  119 Main StreetKennebunk,ME 04043
1                    John S. Webb                    949 Main StreetSanford,ME 04073
2              William T. Bly Esq                    20 Oak StreetEllsworth,ME 04605
3          Christopher Causey Esq                          16 Middle StSaco,ME 04072
4                 Robert Van Horn                   88 Hammond StreetBangor,ME 04401
5                    John S. Webb       37 Western Ave., Unit #307Kennebunk,ME 04043
6              Hunter J Tzovarras                  4 Union Park RoadTopsham,ME 04086
7      Michael Stephen Bowser Jr.            241 Main StreetP.O. Box 57Saco,ME 04072
8                   Richard Regan            6 City CenterSuite 301Portland,ME 04101
9             Robert Guillory Esq            75 Pearl St. Suite 400Portland,ME 04101
10                  Dylan R. Boyd      160 Capitol StreetP.O. Box 79Augusta,ME 04332
11                 Luke Rioux Esq                 10 Stoney Brook LaneLyman,ME 04002
12               David G. Webbert        15 Columbia Street, Ste. 301Bangor,ME 04401
13                  Amy Fairfield              32 Saco AveOld Orchard Beach,ME 04064
14      Mr. Richard Lyman Hartley         62 Portland Rd., Ste. 44Kennebunk,ME 04043      
15           Neal L Weinstein Esq                647 U.S. Route One#203York,ME 03909      
16                  Albert Hansen      76 Tandberg Trail (Route 115)Windham,ME 04062      
17          Russell Goldsmith Esq        Two Canal PlazaPO Box 4600Portland,ME 04112      
18            Miklos Pongratz Esq           18 Market Square Suite 5Houlton,ME 04730      
19       Bradford Pattershall Esq       5 Island View DrCumberland Foreside,ME 04110      
20             Michele D L Kenney    12 Silver StreetP.O. Box 559Waterville,ME 04903      
21                   John Simpson                 344 Mount Hope Ave.Bangor,ME 04402      
22         Mariah America Gleaton                  192 Main StreetEllsworth,ME 04605      
23                Wayne Foote Esq                85 Brackett StreetPortland,ME 04102      
24                      Will Ashe                  16 Union StreetBrunswick,ME 04011      
25                Peter J Cyr Esq     482 Congress Street Suite 402Portland,ME 04101      
26  Jonathan Steven Handelman Esq                            PO Box 335York,ME 03909      
27            Richard Smith Berne                 36 Ossipee Trl W.Standish,ME 04084      
28             Meredith G. Schmid             75 Pearl St.Suite 216Portland,ME 04101      
29                Gregory LeClerc           28 Long Sands Road, Suite 5York,ME 03909      
30                   Cory McKenna                      20 Mechanic StCamden,ME 04843      
31                Thomas P. Elias  P.O. Box 1049304 Hancock St. Suite 1KBangor,ME...      
32           Christopher  MacLean        1250 Forest Avenue, Ste 3APortland,ME 04103      
33               Zachary J. Smith      415 Congress StreetSuite 202Portland,ME 04101      
34                 Stephen Sweatt      919 Ridge RoadP.O. BOX 119Bowdoinham,ME 04008      
35           Michael Turndorf Esq        1250 Forest Avenue, Ste 3APortland,ME 04103      
36     Andrews Bruce Campbell Esq                   133 State StreetAugusta,ME 04330      
37                Timothy Zerillo               110 Portland StreetFryeburg,ME 04037      
38               Walter McKee Esq          440 Walnut Hill RdNorth Yarmouth,ME 04097      
39                 Shelley Carter                  70 State StreetEllsworth,ME 04605      
  • Related