I am working on scraping multiple pages of search results from this website into a neatly formated pandas dataframe.
I've outlined the steps for how I am to finish this task.
1.) Identify information from each result I want to pull (3 things)
2.) Pull all the information from the 3 things into separate lists
3.) Append items in lists through for loop into pandas dataframe
Here is what I've tried so far:
import requests
import pandas as pd
#!pip install bs4
from bs4 import BeautifulSoup as bs
url = 'https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
soup = bs(result.text, 'html.parser')
titles = soup.find_all('h5')
authors = soup.find_all('p')
#dates = soup.find_all('')
#append in for loop
data=[]
for i in range(2,22):
data.append(titles[i].text)
data.append(authors[i].text)
#data.append(dates[i].text)
data=pd.DataFrame()
Before I convert data to a pandas dataframe, I can see the results, but the last line essentially erases the results.
Also, I'm not quite sure how to iterate over the multiple search result pages. I found some code that allows you to pick a starting and ending web page to iterate over like this:
URL = ['https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy&page=2',
'https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy&page=4']
for url in range(0,2):
req = requests.get(URL[url])
soup = bs(req.text, 'html.parser')
titles = soup.find_all('h5')
print(titles)
The issue I'm having with this approach is that the first page is not formatted the same as all the other pages. Starting on page two, the end of the url reads, "&page=2". Not sure how to account for that.
To summarize the end result I'm looking for would be a dataframe that looks something like this:
Title Author Date
Blah1 Agency1 09/23/2020
Blah2 Agency2 08/22/2018
Blah3 Agency3 06/02/2017
....
Can someone please help point me in the right direction? Very lost on this one.
CodePudding user response:
I think you don't need to parse all pages, just download the csv.
import pandas as pd
import requests
import io
url = 'https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy'
url = '&format=csv' # <- Download as CSV
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
result = requests.get(url, headers=headers)
df = pd.read_csv(io.StringIO(result.text))
Output:
>>> df
title type ... pdf_url publication_date
0 Corporate Average Fuel Economy Standards for M... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/03/2021
1 Public Hearing for Corporate Average Fuel Econ... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/14/2021
2 Investigation of Urea Ammonium Nitrate Solutio... Notice ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/08/2021
3 Anchorage Regulations; Mississippi River, Mile... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-08... 08/30/2021
4 Call for Nominations To Serve on the National ... Notice ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/08/2021
.. ... ... ... ... ...
112 Endangered and Threatened Wildlife and Plants;... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/07/2021
113 Energy Conservation Program: Test Procedures f... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/01/2021
114 Taking of Marine Mammals Incidental to Commerc... Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/17/2021
115 Partial Approval and Partial Disapproval of Ai... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/24/2021
116 Clean Air Plans; California; San Joaquin Valle... Proposed Rule ... https://www.govinfo.gov/content/pkg/FR-2021-09... 09/01/2021
[117 rows x 8 columns]
CodePudding user response:
If I understand your question, then here is the working solution. The starting url and the url with page number = 1 are the same thing and I scrape page range(1,5) meaning 4 pages. You can increase or decrease range of page numbers at any time. To store data in csv format, please uncomment the last line.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
for page in range(1, 5):
url = 'https://www.federalregister.gov/documents/search?conditions[publication_date][gte]=08/28/2021&conditions[term]=economy'&page={page}'.format(page=page)
print(url)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
tags = soup.find_all('div', class_ ='document-wrapper')
for pro in tags:
title = pro.select_one('h5 a').get_text(strip = True)
author = pro.select_one('p a:nth-child(1)').get_text(strip = True)
date = pro.select_one('p a:nth-child(2)').get_text(strip = True)
data.append([title,author,date])
cols = ["Title", "Author","Date"]
df = pd.DataFrame(data,columns=cols)
print(df)
#df.to_csv("data_info.csv", index = False)