So I'm scraping TripAdvisor to get some information and here's one of the lists that I have:
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>ONE OF THE BEST !</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>excellent stay</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wow</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Really nice hotel</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great front desk staff</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Excellent walking tour by Victoria!</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>The best of the best in Lisbon</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great staffs, great hotel and great tours</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wonderful Experience and Best Hotel in Lisbon</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great hotel. Great Staff. Wonderful walking tour with David.</span></span></a></div>
I Basically want to get rid of everything but the links (e.g /ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html)
What's the easiest way to do this in Python?
Here's a screenshot of one of the review page's code if it helps: Trip advisor code screenshot
CodePudding user response:
Perfect job for BeautifulSoup:
import re
from bs4 import BeautifulSoup
html = """
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>ONE OF THE BEST !</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>excellent stay</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wow</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Really nice hotel</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great front desk staff</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Excellent walking tour by Victoria!</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>The best of the best in Lisbon</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great staffs, great hotel and great tours</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wonderful Experience and Best Hotel in Lisbon</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great hotel. Great Staff. Wonderful walking tour with David.</span></span></a></div>
"""
soup = BeautifulSoup(html)
links = []
pattern = re.compile(".*ShowUserReviews-.*-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html")
for a in soup.find_all("a"):
href = a.get("href", "")
if pattern.match(href):
links.append(href)
CodePudding user response:
Simply use css selectors
to access directly <a>
with href
starts with /ShowUserReviews:
[a['href'] for a in soup.select('a[href^="/ShowUserReviews"]')]
or concat with baseUrl https://www.tripadvisor.com/
:
['https://www.tripadvisor.com/' a['href'] for a in soup.select('a[href^="/ShowUserReviews"]')]
Example
html='''
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>ONE OF THE BEST !</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>excellent stay</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wow</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Really nice hotel</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great front desk staff</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Excellent walking tour by Victoria!</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>The best of the best in Lisbon</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great staffs, great hotel and great tours</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wonderful Experience and Best Hotel in Lisbon</span></span></a></div>
<div data-test-target="review-title" dir="ltr"><a dir="" href="/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great hotel. Great Staff. Wonderful walking tour with David.</span></span></a></div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
urls = [a['href'] for a in soup.select('a[href^="/ShowUserReviews"]')]
Output urls
['/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
'/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
'/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
'/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
'/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
'/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
'/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
'/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
'/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
'/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html']