Home > Blockchain >  How can I extract HTML links from this list in Python?
How can I extract HTML links from this list in Python?

Time:04-11

So I'm scraping TripAdvisor to get some information and here's one of the lists that I have:

<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>ONE OF THE BEST !</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>excellent stay</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wow</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Really nice hotel</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great front desk staff</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Excellent walking tour by Victoria!</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>The best of the best in Lisbon</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great staffs, great hotel and great tours</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wonderful Experience and Best Hotel in Lisbon</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great hotel.  Great Staff.  Wonderful walking tour with David.</span></span></a></div>

I Basically want to get rid of everything but the links (e.g /ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html)

What's the easiest way to do this in Python?

Here's a screenshot of one of the review page's code if it helps: Trip advisor code screenshot

CodePudding user response:

Perfect job for BeautifulSoup:

import re
from bs4 import BeautifulSoup

html = """
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>ONE OF THE BEST !</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>excellent stay</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wow</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Really nice hotel</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great front desk staff</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Excellent walking tour by Victoria!</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>The best of the best in Lisbon</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great staffs, great hotel and great tours</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wonderful Experience and Best Hotel in Lisbon</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great hotel.  Great Staff.  Wonderful walking tour with David.</span></span></a></div>
"""
soup = BeautifulSoup(html)

links = []
pattern = re.compile(".*ShowUserReviews-.*-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html")
for a in soup.find_all("a"):
    href = a.get("href", "")
    if pattern.match(href):
        links.append(href)

CodePudding user response:

Simply use css selectors to access directly <a> with href starts with /ShowUserReviews:

[a['href'] for a in soup.select('a[href^="/ShowUserReviews"]')]

or concat with baseUrl https://www.tripadvisor.com/:

['https://www.tripadvisor.com/' a['href'] for a in soup.select('a[href^="/ShowUserReviews"]')]
Example
html='''
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>ONE OF THE BEST !</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>excellent stay</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wow</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Really nice hotel</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great front desk staff</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Excellent walking tour by Victoria!</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>The best of the best in Lisbon</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great staffs, great hotel and great tours</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Wonderful Experience and Best Hotel in Lisbon</span></span></a></div>
<div  data-test-target="review-title" dir="ltr"><a  dir="" href="/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html"><span><span>Great hotel.  Great Staff.  Wonderful walking tour with David.</span></span></a></div>

'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
urls = [a['href'] for a in soup.select('a[href^="/ShowUserReviews"]')]
Output urls
['/ShowUserReviews-g189158-d229324-r832749959-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d229324-r832190054-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d229324-r831182259-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d229324-r830900803-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d229324-r829471539-Sheraton_Lisboa_Hotel_Spa-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r833957443-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r819463197-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r833862442-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r833861014-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html',
 '/ShowUserReviews-g189158-d12659702-r833717753-Corpo_Santo_Lisbon_Historical_Hotel-Lisbon_Lisbon_District_Central_Portugal.html']
  • Related