I am working on a project that involves analyzing the text of political emails from this website: https://politicalemails.org/. I am attempting to scrape all the emails using BeautifulSoup and pandas. I have a working chunk right here:
#Import libraries
import numpy as np
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Check if scraping is allowed
url = 'https://politicalemails.org/messages'
page = requests.get(url)
page
#Prepare empty dataframe
df = pd.DataFrame(
{
'sender':[''],
'subject':[''],
'date':[''],
'body':['']
}
)
#Loop through emails and scrape
url_base = 'https://politicalemails.org/messages/'
#email_pages=50
for i in range(2,5):
#for i in range(email_pages):
url_full = url_base str(i)
page = requests.get(url_full)
soup = BeautifulSoup(page.text,'lxml')
email = soup.find_all('td',class_='content-box-meta__value')
message = soup.find_all('div',class_='message-text')
sender = email[0].text.strip()
subject = email[1].text.strip()
date = email[2].text.strip()
body = message[0].text.strip()
df = df.append({
'sender':sender,
'subject':subject,
'date':date,
'body':body
},ignore_index=True)
df.head()
The above results in pulling the data I want. However, I want to loop through larger chunks of the emails in this archive. Just checking out either one of the following links:
print(url_base str(0))
print(url_base str(1))
print(url_base str(100))
results in a '404 Not Found' error. How can I build a "skip" logic that sees if there is no information to scrape from the website and then moves on to the next iteration? If I used the commented out chunk of code with the email_pages = 50
, I will get an error that reads:
IndexError: list index out of range
How should I approach editing my for loop to account for this behavior?
CodePudding user response:
I'd advise using a switch case for situations like these.
match page.status_code:
case 404:
continue
If your Python version does not support switch-case statements, You could do just the same with an if-else
clause.
if page.status_code == 404:
continue
continue
instructs it to move to the next iteration, allowing you to skip the rest of the code since there are no resources retrieved in the request.