Home > OS >  After some tutorials, i tried the scrape the this website.: https://www.usaopps.com/government_contr
After some tutorials, i tried the scrape the this website.: https://www.usaopps.com/government_contr

Time:08-21

This was an attempt to first get the all links from the titles on the first page: this worked but i want to get the links in a .txt files, and get for all available pages too.

bs4 import BeautifulSoup

import requests
import re

URL= "https://www.usaopps.com/government_contractors/naics-111110-Soybean-Farming.htm"
fixed_url= "https://www.usaopps.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="contractor-list")

links = []

contractor_elements = results.find_all("div", class_="lr-title")
for contractors_element in contractor_elements:
    
    links = contractors_element.find_all("a")
    for link in links:
        link_url = link["href"]

        
    print(f"full link:{fixed_url}{link_url}\n")

after that i got the contact person details and fax number with the code

from bs4 import BeautifulSoup
import requests
import re
from urllib.request import urlopen


url = "https://www.usaopps.com/government_contractors/contractor-5922555-BSL-GLOBAL-WATER-SOLUTION.htm"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results_info = soup.find(id="box-sideinfo")
info_elements = results_info.find_all("div", class_="info-gen-box clearfix")
Fax = soup.select("#box-sideinfo > div > dl > dd:nth-child(14)")
contact_person = soup.select("#box-sideinfo > div > dl > dd:nth-child(16)")
print(contact_person)
enter code hereprint(Fax)

i wanted the new url to be the links from my first code and have the both codes together...

CodePudding user response:

This is one way of obtaining that info, and displaying it in a meaningful way:

import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

big_list = []
for i in range(1, 2):
    
    url= f"https://www.usaopps.com/government_contractors/naics-111110-Soybean-Farming.{i}.htm"
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    for x in soup.select('div.list-one')[:3]:
        det_url = 'https://www.usaopps.com'   x.select_one('a').get('href')
#         print(det_url)
        req = requests.get(det_url)
        det_soup = BeautifulSoup(req.text, 'html.parser')
        info_box = det_soup.select_one('div.info-gen-box')
        c_name = info_box.find('dt', text='Company Name:').find_next_sibling('dd').text
        c_address = info_box.find('dt', text='Address:').find_next_sibling('dd').text
        c_phone = info_box.find('dt', text='Phone:').find_next_sibling('dd').text
#         print(c_name, c_address, c_phone)
        big_list.append((c_name, c_address, c_phone))

df = pd.DataFrame(big_list, columns = ['Company', 'Address', 'Phone'])
print(df)

This will print in terminal:

Company Address Phone
0 BSL GLOBAL WATER SOLUTIONS, INC 5020 Campus Dr 949-296-7666
1 JONES 3 CO. LLC 4133 Fishcreek Rd Apt 401 360-279-8638
2 Banneker Ventures, LLC 5 Choke Cherry Road, Suite 378 301-990-4980

There are 83 pages with companies, so this will take some time.

Requests docs: https://requests.readthedocs.io/en/latest/ BeautifulSoup docs: https://beautiful-soup-4.readthedocs.io/en/latest/index.html And of course, pandas docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

  • Related