I am trying to write a web scraping function that does a few things:
- Determines the number of URL's to scrape based on a list of URL's
- Creates a separate file for each URL
- Scrapes the TEXT from each URL
- Inserts the result of each text-scrape into the designated file that was just created
Here is the current code:
#this is the array of URL's
urls = ['https://calevip.org/incentive-project/northern-california',
'https://www.slocleanair.org/community/grants/altfuel.php',
'https://www.mcecleanenergy.org/ev-charging/',
'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
'https://afdc.energy.gov/laws/12309',
'https://cleanvehiclerebate.org/eng/fleet',
'https://calevip.org/incentive-project/san-joaquin-valley']
import requests
from bs4 import BeautifulSoup
import sys
from websites import urls
def scrape():
for x in range (len(urls)):
f = open("test" str(x) ".txt", 'w')
for url in urls:
page = requests.get(url)
#this line of code creates a Beautiful Soup object that takes page.content as input
soup = BeautifulSoup(page.content, "html.parser")
results = (soup.prettify().encode('cp1252', errors='ignore'))
#we need a command that enters the results into the file we just created.
f.write(str(results))
So far, I am able to get the function to perform steps 1 & 2. The problem is the text scrape from the first website are being placed into all 8 of the .text files, instead of the text scrape from the first website being placed into the first .text file, the text scrape of the second website being placed into the second file, the text scrape of the third website being placed into the third file...etc.
How do I fix this? I feel like I am close but my second FOR loop isn't written correctly.
CodePudding user response:
Try doing it this way:-
import requests
from bs4 import BeautifulSoup as BS
urls = ['https://calevip.org/incentive-project/northern-california',
'https://www.slocleanair.org/community/grants/altfuel.php',
'https://www.mcecleanenergy.org/ev-charging/',
'https://www.peninsulacleanenergy.com/ev-charging-incentives/',
'https://www.irs.gov/businesses/plug-in-electric-vehicle-credit-irc-30-and-irc-30d',
'https://afdc.energy.gov/laws/12309',
'https://cleanvehiclerebate.org/eng/fleet',
'https://calevip.org/incentive-project/san-joaquin-valley']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def scrape():
with requests.Session() as session:
i = 1
for url in urls:
try:
page = session.get(url, headers=headers)
page.raise_for_status()
with open(f'test{i}.txt', 'w') as f:
f.write(BS(page.text, 'lxml').prettify())
i = 1
except Exception as e:
print(f'Exception while processing {url} -> {e}')
if __name__ == '__main__':
scrape()