I have this very simple code to webscrape but it takes too long. Before I put timeout, I waited for more than 30 minutes and it still did not return anything. Now I placed timeout for 300 seconds and it always gives me an error message of 'read operation timed out'. I am using Jupyter Notebook and tried using Edge, Chrome, and Opera as my browser but nothing worked. I can assure you that my internet is fine because when I open the URL in a web browser, it loads.
import bs4
import pandas as pd
from bs4 import BeautifulSoup as bssoup
import urllib.request
import re
hotel_reviewnames2 = []
hotel_review2 = []
for i in range (7,9):
urlp1 = 'https://www.tripadvisor.com.ph/Hotel_Review-g298459-d3914035-Reviews-or'
urlp2 = '-Park_Inn_by_Radisson_Davao-Davao_City_Davao_del_Sur_Province_Mindanao.html#REVIEWS/robots.txt'
realurl = str(urlp1 str(i*5) urlp2)
print (realurl)
hp6 = urllib.request.urlopen(realurl, timeout = 300)
soup6 = bssoup (hp6, 'html.parser')
hotelreviewsDavaonames3 = soup6.find_all('a', class_='ui_header_link bPvDb')
hotelreviewsDavao3 = soup6.find_all('div', class_='pIRBV _T')
for x in hotelreviewsDavaonames3:
hotel_reviewnames2.append(x.text.split())
for y in hotelreviewsDavao3[1::2]:
hotel_review2.append(y.text.split('\n'))
df4 = pd.DataFrame({
'Reviewer_Name':hotel_reviewnames2,
'Reviews':hotel_review2
})
print (df4)
CodePudding user response:
What you are (probably) doing is forbidden by the TripAdvisor "Terms and Conditions".
If the URL works from your web browser but not from a web scraper, they probably are using unspecified "technical means" to distinguish your scraper's requests from regular (legitimate) web requests, and black-holing your requests.
My advice would be to stop.
If you succeed in getting around this roadblock, they are likely to use other means to block you, possibly escalating to "Cease and Desist" letters and lawsuits. (Or they could play really nasty ...)
CodePudding user response:
With what Stephen C said, that could be problematic. Scraping without consent is usually ill advised.
That being said, I believe that scraping is always going to happen and there's something fun about beating the people trying to stop you.
I'm not entirely sure what the requirements for the course you mentioned are but perhaps look into selenium. It's a python package with a nice tutorial here. It creates an actual browser that pseudo simulates a real user with cookies and might be able to trick the website.
Alternatively, you can see if you can add certain headers to the urllib.request to simulate an actual browser. I know the requests library adds these and but not sure of urllib.request ssomething like headers = { "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8", "Dnt": "1", "Host": [INSERT SPECIFIC URLHOST], "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36", }
Source: Interned at a company that scraped a lot of public data and with people who tried to slow us down. One of my coworkers used selenium to beat captchas too!