Home > OS >  Trying to scrape a website but I don't get HTML content
Trying to scrape a website but I don't get HTML content

Time:07-28

I'm trying to scrape this website but I don't get what I see in "Inspect Elements". I feel like HTML content is hidden or something :

from bs4 import BeautifulSoup 
import requests

result = requests.get("https://groceries.asda.com/aisle/price-match/view-all-price-match/view-all-price-match/1215686354045-1215686354052-1215686354053")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print(soup)

This is what I see and what I want in the inspect element :

enter image description here

But what I get when I print the soup is something else ( please try to execute this code because the output will be long to paste it here )

CodePudding user response:

I would suggest to use selenium for rendering the page in headless mode and get the page source in this way:

from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
start_url = "https://duckgo.com"
driver.get(start_url)
print(driver.page_source.encode("utf-8"))
driver.quit()    

Replace the url with the page's url you want to scrap, and try to parse the page_source with bs4 after.

CodePudding user response:

The webpage is loaded dynamically via JS. So you can't see the the html content with the help of bs4. If your final aim is to scrape data then you can do that using API too. which is the robust along with the easiest way to grab data using requests module only.

Example:

import requests

api_url = "https://groceries.asda.com/api/bff/graphql"
payload= {"requestorigin":"gi","contract":"web/cms/get-items","variables":{"user_segments":["1259","1194","1140","1141","1182","1130","1128","1124","1126","1119","1123","1117","1112","1116","1109","1111","1102","1110","1097","1105","1100","1107","1098","1038","1087","1099","1070","1082","1067","1047","1059","1057","1055","1053","1043","1041","1042","1027","1023","1024","1020","1019","1007","1242","1241","1262","1239","1256","1245","1237","1263","1264","1233","1249","1260","1247","1238","1236","1227","1208","1220","1210","1172","1178","1222","1231","1217","1179","1225","1207","1167","1221","1219","1160","1180","1152","1213","1206","1176","1224","1165","1159","1209","1169","1144","1214","1177","1216","1196","1173","1186","1147","1183","1204","1174","1191","1201","1202","1190","1157","1198","1189","1166","1197","1150","1170","1184","1271","1278","1279","1269","1283","1284","1285","rmp_enabled_user","dp-False","wapp","store_4565","vp_M","anonymous","clothing_store_enabled","checkoutOptimization","NAV_UI","T003","T014"],"store_id":"4565","page":2,"page_size":60,"request_origin":"gi","type":"content","ship_date":1658880000000,"payload":{"cacheable":True,"hierarchy_id":"1215686354045-1215686354052-1215686354053","filter_query":[]}}}
headers={
    'content-type': 'application/json',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    'request-origin': 'gi'
}
data = requests.post(api_url,headers=headers,json=payload).json()

for item in data['data']['tempo_items']['products']['items']:
    print(item['item']['name'])

Output:

Fixodent Complete Denture Adhesive Original
Surf Tropical Lily Concentrated Liquid Laundry Detergent 24 Washes
Always Maxi Profresh Night Sanitary Towels Without Wings
Pantene 3 Minute Miracle Repair&Protect Hair Conditioner
Garnier Ultimate Blends Coconut Oil Frizzy Hair Shampoo
Pedigree Schmackos Strips Adult Dog Treats Fish Mix
TRESemme Replenish & Cleanse Conditioner
Herbal Essences Hello Hydration Shampoo For Dry Hair
Blistex Relief Cream
Garnier Skin Active Micellar Cleansing Water Sensitive Skin       
TRESemme Rich Moisture Conditioner
Lemsip Max Day & Night Cold & Flu Relief Capsules
Lenor In-Wash Scent Booster Spring Awakening
Sudafed Congestion Headache Relief Day & Night Capsules
Halls Mentholyptus Extra Strong Lozenges 10 pack
Panadol Advance Paracetamol Tablets x16
Always Dailies Extra Protect Large Panty Liners
Simple Kind To Skin Purifying Cleansing Lotion
Nivea Gentle Exfoliating Face Scrub
Simple Kind to Skin Refreshing Facial Wash Gel
Pantene 3 Minute Miracle Smooth&Sleek Hair Conditioner
Olbas Oil Inhalant Decongestant
Johnson's Bedtime Shampoo
Huggies DryNites Pyjama Pants Girl 8-15 Years
Garnier Belle Color 6 Natural Light Brown Permanent Hair Dye
Westlab Pure Mineral Bathing Epsom Salt
Herbal Essences Ignite My Colour Hair Conditioner For Coloured Hair
Poligrip Denture Adhesive Ultra Fixative Cream
Garnier Ultimate Blends Argan Oil & Almond Cream Dry Hair Conditioner
Halls Original Sugar Free Lozenges 10 pack
Huggies DryNites Pyjama Pants Boy 8-15 Years
Westlab Sleep Epsom & Dead Sea Salts with Lavender & Jasmine
Herbal Essences Ignite My Colour Shampoo For Coloured Hair
Westlab Mindful Epsom & Himalayan Salts with Frankincense & Bergamot
Jolen Creme Bleach
Garnier Belle Color 7.1 Natural Dark Ash Blonde Permanent Hair Dye
Herbal Essences Dazzling Shine Hair Conditioner For All Hair Type
Dettol Antibacterial Disinfectant Multi Surface Spray Lemon & Lime
Lemsip Cold & Flu Lemon Flavour Sachets
Toplife Puppy Formula Milk
Westlab Pure Mineral Bathing Dead Sea Salt
Misfits Nasher Sticks Adult Medium Dog Treats with Chicken and Beef
Dove Deeply Nourishing Body Wash
Dreamies Cat Treat Biscuits with Chicken Mega Pack
Deep Freeze Cold Spray
Tena Lady Discreet Mini Pads
Pantene Pro-V Smooth & Sleek 3in1 Shampoo
Garnier Nutrisse 4.3 Dark Golden Brown Permanent Hair Dye
Fixodent Plus Dual Power Denture Adhesive
Beechams All In One Oral Solution 8 Doses 160ML
Panadol Extra Advance 500mg/65mg Tablets x14
Duck Fresh Brush Toilet Cleaning System Holder
Oral-B Allrounder Black Manual Toothbrush x 3
Dove Indulging Cream Bath Soak
Garnier Ultimate Blends Honey Treasures Strengthening Conditioner
Sudafed Sinus Max Strength Capsules
Johnson's Baby Shampoo
Halls Soothers Cherry Lozenges
Rennie Spearmint Heartburn & Indigestion Relief Tablets
Huggies DryNites Pyjama Pants Boy 4-7 Years

Selenium with bs4:

As API has no communication with HTML content So we can't get html content via API. The webpage is dynamic and bs4 can't render JS. So to get html content you can use selenium with bs4. The following code will produce the right html content from the page.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_experimental_option("detach", True)
# chrome_options.add_argument("--headless")

webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://groceries.asda.com/aisle/price-match/view-all-price-match/view-all-price-match/1215686354045-1215686354052-1215686354053'
driver.get(url)
driver.maximize_window()
time.sleep(5)
#accept cookie
driver.find_element(By.XPATH,'//*[@id="onetrust-button-group-parent"]/div/button[1]').click()
time.sleep(2)
soup=BeautifulSoup(driver.page_source,'lxml')
html=soup.select_one('div.co-product-list > ul:nth-child(1)')
print(html.prettify())

Output:

<li >
       <div >
        <button aria-label="show information on Smooth &amp; Frizz Free"  data-auto-id="btnPromo" type="button">
         <picture >
          <source srcset="https://ui.assets-asda.com/dm/_103_frizzfree?$icon-wapp$=&amp;$Icon-wapp$=">
           <img alt="Smooth &amp; Frizz Free"  data-auto-id="" loading="lazy" src="https://ui.assets-asda.com/dm/_103_frizzfree?$icon-wapp$=&amp;$Icon-wapp$=" title="Smooth &amp; Frizz Free"/>
          </source>
         </picture>
        </button>
       </div>
      </li>
     </ul>
    </div>
   </div>
   <div >
    <div >
     <span >
      <strong >
       <span >
        now
       </span>
       £1.99
      </strong>
      <p >
       <span >
        (55.3p/100ml)
       </span>
      </p>
     </span>
    </div>
    <div >
     <div >
      <span  data-auto-id="">
       OUT OF STOCK
      </span>
      <button aria-disabled="false"  data-auto-id="linkSeeAlternatives" type="button">
       See alternatives
      </button>
     </div>
    </div>
   </div>
  </div>
 </li>

... so on

CodePudding user response:

enter image description here

If you go to the website on incognito mode, you get this. You have to click the box before you get all the HTML. I'm not sure what a good solution would be, but you could look into Selenium.

CodePudding user response:

The website's content serving is most likely based on cookies and other request headers. What it means is that HTML is dynamically generated and will vary from client to client.

For example, try setting "User-Agent" header to some common, browser-related value.

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Firefox/102.0'}
result = requests.get(
    url="https://groceries.asda.com/aisle/price-match/view-all-price-match/view-all-price-match/1215686354045-1215686354052-1215686354053",
    headers=headers
)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print(soup)

You will then see that the content is different. If you want to see exactly the same results, try copying all the headers used by your browser's request into your code, but even then it's not guaranteed.

Also as other mates mention, you should use selenium to interpret the JavaScript that the website returns.

  • Related