How can i get every link of medicines with beautifulsoup-CodePudding

So I want to scrape the link of the medicines on this link Medicines List where every alphabet has a view more buttons.

import requests
from bs4 import BeautifulSoup
AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'
BASEURL = 'https://www.klikdokter.com/obat'
headers = {'User-Agent': AGENT}
response = requests.get(BASEURL, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
for tag in soup.find_all('a', class_='topics-index--section__item-link'):
    href = tag.get('href')
    if href is not None:
        print(href)
        response = requests.get(href, headers=headers)
        response.raise_for_status()

with this code, I already got some of the medicine but I'm missing out on every medicine after I click on the view more button, can anyone guide me on how to get the link of the medicines that I miss.

CodePudding user response：

Looking at the Network tab, I see that clicking on each letter of the alphabet yields an API call. You should create a list of the pages with all medicine names and then iterate and scrape through those:

alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'x', 'y', 'z']
target_urls = []
for letter in alphabet:
    target_urls.append(f"https://www.klikdokter.com/obat?alphabet={letter}&partial=1")
print(target_urls)

Which yields:

['https://www.klikdokter.com/obat?alphabet=a&partial=1', 'https://www.klikdokter.com/obat?alphabet=b&partial=1', 'https://www.klikdokter.com/obat?alphabet=c&partial=1', 'https://www.klikdokter.com/obat?alphabet=d&partial=1', 'https://www.klikdokter.com/obat?alphabet=e&partial=1', 'https://www.klikdokter.com/obat?alphabet=f&partial=1', 'https://www.klikdokter.com/obat?alphabet=g&partial=1', 'https://www.klikdokter.com/obat?alphabet=h&partial=1', 'https://www.klikdokter.com/obat?alphabet=i&partial=1', 'https://www.klikdokter.com/obat?alphabet=j&partial=1', 'https://www.klikdokter.com/obat?alphabet=k&partial=1', 'https://www.klikdokter.com/obat?alphabet=l&partial=1', 'https://www.klikdokter.com/obat?alphabet=m&partial=1', 'https://www.klikdokter.com/obat?alphabet=n&partial=1', 'https://www.klikdokter.com/obat?alphabet=o&partial=1', 'https://www.klikdokter.com/obat?alphabet=p&partial=1', 'https://www.klikdokter.com/obat?alphabet=q&partial=1', 'https://www.klikdokter.com/obat?alphabet=r&partial=1', 'https://www.klikdokter.com/obat?alphabet=s&partial=1', 'https://www.klikdokter.com/obat?alphabet=t&partial=1', 'https://www.klikdokter.com/obat?alphabet=u&partial=1', 'https://www.klikdokter.com/obat?alphabet=v&partial=1', 'https://www.klikdokter.com/obat?alphabet=x&partial=1', 'https://www.klikdokter.com/obat?alphabet=y&partial=1', 'https://www.klikdokter.com/obat?alphabet=z&partial=1']

Then simply create a for loop to iterate through all the above links and extract the information you need (note: the HTML may be different on this view so you might have to tweak the soup.find_all() code):

for url in target_urls:
    BASEURL = url
    response = requests.get(BASEURL, headers=headers)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')
    for tag in soup.find_all('a', class_='topics-index--section__item-link'):
        href = tag.get('href')
        if href is not None:
            print(href)
            response = requests.get(href, headers=headers)
            response.raise_for_status()