Trying to create a code that will get reviewer's name and reviews from Booking.com.
I was able to get all the necessary URLs and isolate reviewer's name and comments from the HTML code but I'm struggling to create a while to go to the next review.
The while loop should take the reviewer's name append it to the list, move to the next name append it and so forth. I also need to the same for the comment.
When running the code nothing happens and I'm not sure where my issue is.
#Loop parameters
##HTMLs
#Booking.com URL
search_url[0] = 'https://www.booking.com/reviews/us/hotel/shore-cliff.es.html?label=gen173nr-1DEgdyZXZpZXdzKIICOOgHSDNYBGiTAogBAZgBCrgBF8gBDNgBA-gBAYgCAagCA7gC5bPZkQbAAgHSAiQzMTc3NTA4OS00OGRkLTQ5ZjYtYjBhNi1kOWEzYzZhN2QwOWXYAgTgAgE;sid=3e3ae22b47e3df3ac2590eb19d37f888;customer_type=total;hp_nav=0;old_page=0;order=featuredreviews;page=1;r_lang=all;rows=75&'
link = search_urls[0] #Just the first one to try
url = link
html = urllib.request.urlopen(url).read().decode('utf-8') #loading each search page
#Main HTML of first hotel
index=html.find('')
review_list_html = html[index:]
##Lists:
hotels=[]
reviewer_name=[]
review_comment=[]
#Creating counter variable
counter=0
reviewercount =0
#Main HTML of first hotel
index=html.find('')
review_list_html = html[index:]
reviewer_html = review_list_html[review_list_html.find('reviewer_name'):]
review_html = review_list_html[review_list_html.find('>'):]
#Loop to get reviewer
while review_list_html.find('reviewer_name'):
#Get reviewer's name
#Start of reviewers name
start =reviewer_html.find('<span itemprop="name">') 22 #To ignore <span itemprop="name"> and jump right the name
start
#End of reviewers name
end =reviewer_html.find('</span>')
#Isolating reviewers name
reviewer_html=reviewer_html[start:end]
#Adding reviewer to list
reviewer_name.append(reviewer_html)
CodePudding user response:
Your issue is that every next index lookup you need to start from previous index, otherwise you will create eternal loop. Generally it's more common to use HTML parsers like Beautiful Soup, but it's absolutely possible to parse this page with method you're trying to use.
We can use "reviewer_name"
as main index for every review block. Starting from this index we will get indexes of "name"
and </span>
. Text between those indexes is reviewer's name. To parse review body we will find all indexes of "reviewBody"
before index of next review block.
Full code:
from urllib.request import urlopen
link = "https://www.booking.com/reviews/us/hotel/shore-cliff.es.html"
with urlopen(link) as request:
response = request.read().decode()
reviews = []
name_pos = response.find('"reviewer_name"') # find first review
while name_pos >= 0:
name = ""
review_blocks = []
start_pos = response.find('"name"', name_pos)
end_pos = response.find("</span>", start_pos)
if end_pos > start_pos >= 0:
name = response[start_pos 7: end_pos]
prev_name_pos = name_pos
name_pos = response.find('"reviewer_name"', name_pos 1) # get next review
start_pos = response.find('"reviewBody"', prev_name_pos, name_pos)
while start_pos >= 0:
end_pos = response.find("</span>", start_pos)
if end_pos > start_pos >= 0:
review_blocks.append(response[start_pos 13: end_pos])
start_pos = response.find('"reviewBody"', start_pos 1, name_pos)
reviews.append((name, "\n".join(review_blocks)))
reviews
content:
[
('Adriana',
'Nada para criticar.\n'
'Impecable lugar, habitación con vistas hermosas cualquiera sea. Camas '
'confortables, pequeña cocina completa, todo impecable.\n'
'La atención en recepción excelente, no se pierdan las cookies que convidan '
'por la tarde allí. El desayuno variado y con unos tamales exquisitos! Cerca '
'de todo.'),
('Ana', 'Todo excelente'),
('Lara',
'simplemente un poco de ruido en el tercer piso pero solo fue un poco antes '
'de las 10:00pm\n'
'realmente todo estaba excelente, ese gran detalle de el desayuno se les '
'agradece mucho.'),
('Rodrigo',
'Todo me gustó solo lo único que me hubiera gustado que también tuvieran es '
'unas chimeneas.\n'
'El hotel tiene una hermosa vista y se puede caminar y disfrutar por toda la '
'orilla de la playa hasta llegar al muelle y mas lejos si uno quiere.'),
('May', 'Me encanto q estaba abierta la piscina