How to get the value of a "hidden" href?-CodePudding

I'm working with web scraping to, at first, collect the total pages. I have tested the code I made for another site and however I am having a problem getting the next page link (href).

Here's the code:

from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests

userName = 'brendanm1975' # just for testing

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

pages = []

with requests.Session() as session:
  page_number = 1
  url = "https://www.last.fm/user/" userName "/library/artists?page="
  while True:
      response = session.get(url, headers=headers)
      soup = BeautifulSoup(response.content, 'html.parser')
      pages.append(url)

      next_link = soup.find("li", class_="pagination-next")
      if next_link is None:
        break

      url = urljoin(url, next_link["href"])
      page_number  = 1

As you can see, the href of this site presents the link as "?page=2", which does not allow me to get its content (https://www.last.fm/user/brendanm1975/library/artists?page=2).

I've already inspected the variables, and I'm getting the values.

print(url) # output: https://www.last.fm/user/brendanm1975/library/artists?page=
next_link.find('a').get('href') # output: '?page=2'

Does anyone know how to get around this?

CodePudding user response：

What happens?

You try to urljoin(url, next_link["href"]) but next_link do not have an attribute href cause you are selecting the <li> not the <a>.

How to fix?

Option#1 - Just select the <a> in your urljoin():

url = urljoin(url, next_link.a["href"])

Option#2 - Select the <a> directly:

next_link = soup.select_one('li.pagination-next a')

Example

from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests

userName = 'brendanm1975' # just for testing

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

pages = []

with requests.Session() as session:

    url = "https://www.last.fm/user/" userName "/library/artists?page=1"
    while True:
        response = session.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        pages.append(url)

        next_link = soup.find("li", class_="pagination-next")
        if next_link is None:
            break

        url = urljoin(url, next_link.a["href"])

Output

['https://www.last.fm/user/brendanm1975/library/artists?page=1',
 'https://www.last.fm/user/brendanm1975/library/artists?page=2',
 'https://www.last.fm/user/brendanm1975/library/artists?page=3',
 'https://www.last.fm/user/brendanm1975/library/artists?page=4',
 'https://www.last.fm/user/brendanm1975/library/artists?page=5',
 'https://www.last.fm/user/brendanm1975/library/artists?page=6',
 'https://www.last.fm/user/brendanm1975/library/artists?page=7',
 'https://www.last.fm/user/brendanm1975/library/artists?page=8',
 'https://www.last.fm/user/brendanm1975/library/artists?page=9',
 'https://www.last.fm/user/brendanm1975/library/artists?page=10',
 'https://www.last.fm/user/brendanm1975/library/artists?page=11',
 'https://www.last.fm/user/brendanm1975/library/artists?page=12',
 'https://www.last.fm/user/brendanm1975/library/artists?page=13',
 'https://www.last.fm/user/brendanm1975/library/artists?page=14',
 'https://www.last.fm/user/brendanm1975/library/artists?page=15',
 'https://www.last.fm/user/brendanm1975/library/artists?page=16',
 'https://www.last.fm/user/brendanm1975/library/artists?page=17',
 'https://www.last.fm/user/brendanm1975/library/artists?page=18',...]