Home > OS >  Trouble with Beautiful Soup Scraping
Trouble with Beautiful Soup Scraping

Time:09-28

I am working on scraping multiple pages of search results from this website into a neatly formated pandas dataframe.

I've outlined the steps for how I am to finish this task.

1.) Identify information from each result I want to pull (3 things)

2.) Pull all the information from the 3 things into separate lists

3.) Append items in lists through for loop into pandas dataframe

Here is what I've tried so far:

import requests
import pandas as pd
#!pip install bs4
from bs4 import BeautifulSoup as bs

url = 'https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

result = requests.get(url, headers=headers)

soup = bs(result.text, 'html.parser')
titles = soup.find_all('h5')
authors = soup.find_all('p')
#dates = soup.find_all('')

#append in for loop

data=[]

for i in range(2,22):
    data.append(titles[i].text)
    data.append(authors[i].text)
    #data.append(dates[i].text)

data=pd.DataFrame()

Before I convert data to a pandas dataframe, I can see the results, but the last line essentially erases the results.

Also, I'm not quite sure how to iterate over the multiple search result pages. I found some code that allows you to pick a starting and ending web page to iterate over like this:

URL = ['https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy&page=2',
       'https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy&page=4']
  
for url in range(0,2):
    req = requests.get(URL[url])
    soup = bs(req.text, 'html.parser')
  
    titles = soup.find_all('h5')

    print(titles)

The issue I'm having with this approach is that the first page is not formatted the same as all the other pages. Starting on page two, the end of the url reads, "&page=2". Not sure how to account for that.

To summarize the end result I'm looking for would be a dataframe that looks something like this:

Title Author Date
Blah1 Agency1 09/23/2020
Blah2 Agency2 08/22/2018
Blah3 Agency3 06/02/2017
....

Can someone please help point me in the right direction? Very lost on this one.

CodePudding user response:

I think you don't need to parse all pages, just download the csv.

import pandas as pd
import requests
import io

url = 'https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy'
url  = '&format=csv'  # <- Download as CSV

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

result = requests.get(url, headers=headers)

df = pd.read_csv(io.StringIO(result.text))

Output:

>>> df
                                                 title           type  ...                                            pdf_url publication_date
0    Corporate Average Fuel Economy Standards for M...  Proposed Rule  ...  https://www.govinfo.gov/content/pkg/FR-2021-09...       09/03/2021
1    Public Hearing for Corporate Average Fuel Econ...  Proposed Rule  ...  https://www.govinfo.gov/content/pkg/FR-2021-09...       09/14/2021
2    Investigation of Urea Ammonium Nitrate Solutio...         Notice  ...  https://www.govinfo.gov/content/pkg/FR-2021-09...       09/08/2021
3    Anchorage Regulations; Mississippi River, Mile...  Proposed Rule  ...  https://www.govinfo.gov/content/pkg/FR-2021-08...       08/30/2021
4    Call for Nominations To Serve on the National ...         Notice  ...  https://www.govinfo.gov/content/pkg/FR-2021-09...       09/08/2021
..                                                 ...            ...  ...                                                ...              ...
112  Endangered and Threatened Wildlife and Plants;...  Proposed Rule  ...  https://www.govinfo.gov/content/pkg/FR-2021-09...       09/07/2021
113  Energy Conservation Program: Test Procedures f...  Proposed Rule  ...  https://www.govinfo.gov/content/pkg/FR-2021-09...       09/01/2021
114  Taking of Marine Mammals Incidental to Commerc...           Rule  ...  https://www.govinfo.gov/content/pkg/FR-2021-09...       09/17/2021
115  Partial Approval and Partial Disapproval of Ai...  Proposed Rule  ...  https://www.govinfo.gov/content/pkg/FR-2021-09...       09/24/2021
116  Clean Air Plans; California; San Joaquin Valle...  Proposed Rule  ...  https://www.govinfo.gov/content/pkg/FR-2021-09...       09/01/2021

[117 rows x 8 columns]

CodePudding user response:

If I understand your question, then here is the working solution. The starting url and the url with page number = 1 are the same thing and I scrape page range(1,5) meaning 4 pages. You can increase or decrease range of page numbers at any time. To store data in csv format, please uncomment the last line.

Code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = []

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
for page in range(1, 5):
    url = 'https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy'&page={page}'.format(page=page)
    print(url)
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    tags = soup.find_all('div', class_ ='document-wrapper')
    
    for pro in tags:
        title = pro.select_one('h5 a').get_text(strip = True)
        author = pro.select_one('p a:nth-child(1)').get_text(strip = True)
        date = pro.select_one('p a:nth-child(2)').get_text(strip = True)
        data.append([title,author,date])


   
cols = ["Title", "Author","Date"]

df = pd.DataFrame(data,columns=cols)

print(df)

#df.to_csv("data_info.csv", index = False)
  • Related