BeautifulSoup Data Scraping : Unable to fetch correct information from the page-CodePudding

I am trying to scrape data from:- https://www.canadapharmacy.com/

below are a few pages that I need to scrape:-

I need all the information from the page. I wrote the below code:-

base_url = 'https://www.canadapharmacy.com'
data = []
for i in tqdm(range(len(medicine_url))):
    r = requests.get(base_url medicine_url[i])
    
    soup = BeautifulSoup(r.text,'lxml')
    # Scraping medicine Name
    try:
        main_name = (soup.find('h1',{"class":"mn"}).text.lstrip()).rstrip()
    except:
        main_name = None
    
    try:
        sec_name = (soup.find('div',{"class":"product-name"}).find('h3').text.lstrip()).rstrip()
    except:
        sec_name = None
    
    try:
        generic_name = (soup.find('div',{"class":"card product generic strength equal"}).find('div').find('h3').text.lstrip()).rstrip()
    except:
        generic_name = None
        
    # Description
    
    try:
        des1 = soup.find('div',{"class":"answer expanded"}).find_all('p')[1].text
    except:
        des1 = '' 
    
    try:
        des2 = soup.find('div',{"class":"answer expanded"}).find('ul').text
    except:
        des2 = ''
    
    try:
        des3 = soup.find('div',{"class":"answer expanded"}).find_all('p')[2].text
    except:
        des3 = ''
    
    desc = (des1 des2 des3).replace('\n',' ')
    
    #Directions
    
    try:
        dir1 = soup.find('div',{"class":"answer expanded"}).find_all('h4')[1].text
    except:
        dir1 = ''
    
    try:
        dir2 = soup.find('div',{"class":"answer expanded"}).find_all('p')[5].text
    except:
        dir2 = ''
    
    try:
        dir3 = soup.find('div',{"class":"answer expanded"}).find_all('p')[6].text
    except:
        dir3 = ''

    try:
        dir4 = soup.find('div',{"class":"answer expanded"}).find_all('p')[7].text
    except:
        dir4 = ''
    
    directions = dir1 dir2 dir3 dir4
    
    #Ingredients
    try:
        ing = soup.find('div',{"class":"answer expanded"}).find_all('p')[9].text
    except:
        ing = None
        
    #Cautions
    try:
        c1 = soup.find('div',{"class":"answer expanded"}).find_all('h4')[3].text
    except:
        c1 = None
    
    
    try:
        c2 = soup.find('div',{"class":"answer expanded"}).find_all('p')[11].text
    except:
        c2 = ''
    
    try:
        c3 = soup.find('div',{"class":"answer expanded"}).find_all('p')[12].text #//div[@class='answer expanded']//p[2]
    except:
        c3 = ''
    
    try:
        c4 = soup.find('div',{"class":"answer expanded"}).find_all('p')[13].text
    except:
        c4 = ''
    
    try:
        c5 = soup.find('div',{"class":"answer expanded"}).find_all('p')[14].text
    except:
        c5 = ''
    
    try:
        c6 = soup.find('div',{"class":"answer expanded"}).find_all('p')[15].text
    except:
        c6 = ''
    
    caution = (c1 c2 c3 c4 c5 c6).replace('\xa0','')
    
    #Side Effects
    try:
        se1 = soup.find('div',{"class":"answer expanded"}).find_all('h4')[4].text
    except:
        se1 = ''
    
    try:
        se2 = soup.find('div',{"class":"answer expanded"}).find_all('p')[18].text
    except:
        se2 = ''

    try:
        se3 = soup.find('div',{"class":"answer expanded"}).find_all('ul')[1].text
    except:
        se3 = ''

    try:
        se4 = soup.find('div',{"class":"answer expanded"}).find_all('p')[19].text
    except:
        se4 = ''
    
    try:
        se5 = soup.find('div',{"class":"post-author-bio"}).text
    except:
        se5 = ''

    se = (se1   se2   se3   se4   se5).replace('\n',' ')

    for j in soup.find('div',{"class":"answer expanded"}).find_all('h4'):
        if 'Product Code' in j.text:
            prod_code = j.text
        
    #prod_code = soup.find('div',{"class":"answer expanded"}).find_all('h4')[5].text #//div[@class='answer expanded']//h4
    
    pharma = {"primary_name":main_name,
          "secondary_name":sec_name,
          "Generic_Name":generic_name,
          "Description":desc,
          "Directions":directions,
          "Ingredients":ing,
          "Caution":caution,
          "Side_Effects":se,
          "Product_Code":prod_code}
    
    data.append(pharma)

But, each page is having different positions for the tags hence not giving correct data. So, I tried:-

soup.find('div',{"class":"answer expanded"}).find_all('h4')

which gives me the output:-

[<h4>Description </h4>,
 <h4>Directions</h4>,
 <h4>Ingredients</h4>,
 <h4>Cautions</h4>,
 <h4>Side Effects</h4>,
 <h4>Product Code : 5513 </h4>]

I want to create a data frame where the description contains all the information given in the description, directions contain all the information of directions given on the web page.

for i in soup.find('div',{"class":"answer expanded"}).find_all('h4'):
    if 'Description' in i.text:
        print(soup.find('div',{"class":"answer expanded"}).findAllNext('p'))

but it prints all the

after the soup.find('div',{"class":"answer expanded"}).find_all('h4'). but I want only the

tags are giving me the description of the medicine and no others.

Can anyone suggest how to do this? Also, how to scrape the rate table from the page as it gives me values in unappropriate fashion?

CodePudding user response：

You can try the next working example:

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = []
r = requests.get('https://www.canadapharmacy.com/products/abilify-tablet')

soup = BeautifulSoup(r.text,"lxml")
try:
    card = ''.join([x.get_text(' ',strip=True) for x in soup.select('div.answer.expanded')])

    des = card.split('Directions')[0].replace('Description','')
    #print(des)

    drc = card.split('Directions')[1].split('Ingredients')[0]
    #print(drc)
    ingre= card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[0]
    #print(ingre)

    cau=card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[0]
    #print(cau)
    se= card.split('Directions')[1].split('Ingredients')[1].split('Cautions')[1].split('Side Effects')[1]
    #print(se)
except:
    pass 

data.append({
    'Description':des,
    'Directions':drc,
    'Ingredients':ingre,
    'Cautions':cau,
    'Side Effects':se
})

print(data)
# df = pd.DataFrame(data)
# print(df)

Output:

[{'Description': " Abilify Tablet (Aripiprazole) Abilify (Aripiprazole) is a medication prescribed to treat or manage different conditions, including: Agitation associated with schizophrenia or bipolar mania (injection formulation only) Irritability associated with autistic disorder Major depressive disorder , adjunctive treatment Mania and mixed episodes associated with Bipolar I disorder Tourette's disorder Schizophrenia Abilify works by activating different neurotransmitter receptors located in brain cells. Abilify activates D2 (dopamine) and 5-HT1A (serotonin) receptors and blocks 5-HT2A (serotonin) receptors. This combination of receptor activity is responsible for the treatment effects of Abilify. Conditions like schizophrenia, major depressive disorder, and bipolar disorder are caused by neurotransmitter imbalances in the brain. Abilify helps to correct these imbalances and return the normal functioning of neurons. ", 'Directions': ' Once you are prescribed and buy Abilify, then take Abilify exactly as prescribed by your 
doctor. The dose will vary based on the condition that you are treating. The starting dose of Abilify ranges from 2-15 mg once daily, and the recommended dose for most conditions is between 5-15 mg once daily. The maximum dose is 30 mg once daily. Take Abilify with or without food. ', 'Ingredients': ' The active ingredient in Abilify medication is aripiprazole . ', 'Cautions': ' Abilify and other antipsychotic medications have been associated with an increased risk of death in elderly patients with dementia-related psychosis. When combined with other dopaminergic agents, Abilify can increase the risk of neuroleptic malignant syndrome. Abilify can cause metabolic changes and in some cases can induce high blood sugar in people with and without diabetes . Abilify can also weight gain and increased risk of dyslipidemia. Blood glucose should be monitored while taking Abilify. Monitor for low blood pressure and heart rate while taking Abilify; it can cause orthostatic hypertension which may lead to dizziness or fainting. Use with caution in patients with a history of seizures. ', 'Side Effects': ' The side effects of Abilify vary greatly depending 
on what condition is being treated, what other medications are being used concurrently, and what dose is being taken. Speak with your doctor or pharmacist for a full list of side effects that apply to you. Some of the most common side effects include: Akathisia Blurred vision Constipation Dizziness Drooling Extrapyramidal disorder Fatigue Headache Insomnia Nausea Restlessness Sedation Somnolence Tremor Vomiting Buy Abilify online from Canada Pharmacy . Abilify can be purchased online with a valid prescription from a doctor. About Dr. Conor Sheehy (Page Author) Dr. Sheehy (BSc Molecular Biology, PharmD) works a clinical pharmacist specializing in cardiology, oncology, and ambulatory care. He’s a board-certified pharmacotherapy specialist (BCPS), and his experience working one-on-one with patients to fine tune their medication and therapy plans for optimal results makes him a valuable subject matter expert for our pharmacy. Read More.... IMPORTANT NOTE: The above information is intended to increase awareness of health information 
and does not suggest treatment or diagnosis. This information is not a substitute for individual medical attention and should not be construed to indicate that use of the drug is safe, appropriate, or effective for you. See your health care professional for medical advice and treatment. Product Code : 5513'}]