im trying to filter the products name list using the header tags, but it always returns none.

source : https://www.tendercuts.in/chicken

code :

import requests
from bs4 import BeautifulSoup
def ExtractData(url):
response = requests.get(url=url).content
soup = BeautifulSoup(response, 'lxml')
header = soup.find("mat-card-header", {"class": "mat-card-header ng-tns- 
c9-188"})
print(header)
ExtractData(url="https://www.tendercuts.in/chicken")

CodePudding user response：

Here's code to iterate all the <mat-card-header> items showing the class id and the associated text of the card-title. You can further filter on the child elements in each of header items to find particular products.

soup = BeautifulSoup(response, 'lxml')
headers = soup.find_all("mat-card-header")
for header in headers:
   print(header.get('class'), header.find('mat-card-title').text)

Output:

['mat-card-header', 'ng-tns-c9-3'] Chicken Curry Cut (Skin Off)
['mat-card-header', 'ng-tns-c9-3'] Chicken Curry Cut (Skin Off)
...
['mat-card-header', 'ng-tns-c9-19'] Chicken Wings

CodePudding user response：

What happens?

You try to find your tags by class that do not exist in your soup, cause it is generated dynamically and/or is caused by typo.

How to fix?

Select your elements more specific by tag or id and avoid classes cause these are more often created dynamically:

[t.text for t in soup.find_all('mat-card-title')]

To avoid the duplicates just use set() on result:

set([t.text for t in soup.find_all('mat-card-title')])

Example

import requests
from bs4 import BeautifulSoup

URL = 'https://www.tendercuts.in/chicken'
r = requests.get(URL)
soup = BeautifulSoup(r.text)

print(set([t.text for t in soup.find_all('mat-card-title')]))

Output

{'Chicken Biryani Cut - Skin On','Chicken Biryani Cut - Skinless','Chicken Boneless (Cubes)','Chicken Breast Boneless','Chicken Curry Cut (Skin Off)','Chicken Curry Cut (Skin On)','Chicken Drumsticks',     'Chicken Liver','Chicken Lollipop','Chicken Thigh & Leg (Boneless)','Chicken Whole Leg','Chicken Wings','Country Chicken','Minced Chicken','Premium Chicken-Strips (Boneless)','Premium Chicken-Supreme (Boneless)','Smoky Country Chicken (Turmeric)'}

EDIT

To get title and prices I would recommend to iterate the mat-cards like this:

import requests,re
from bs4 import BeautifulSoup

URL = 'https://www.tendercuts.in/chicken'
r = requests.get(URL)
soup = BeautifulSoup(r.text)

data = []
for item in soup.select('mat-card:has(mat-card-title)'):
    data.append({
        'title':item.find('mat-card-title').text,
        'price':re.search(r'₹\d*',item.find('app-price-display').text).group()
    })

print([dict(t) for t in {tuple(d.items()) for d in data}])

Output

[{'title': 'Smoky Country Chicken (Turmeric)', 'price': '₹429'}, {'title': 'Chicken Biryani Cut - Skin On', 'price': '₹135'}, {'title': 'Chicken Thigh & Leg (Boneless)', 'price': '₹245'}, {'title': 'Chicken Lollipop', 'price': '₹89'}, {'title': 'Chicken Wings', 'price': '₹119'}, {'title': 'Chicken Whole Leg', 'price': '₹135'}, {'title': 'Chicken Biryani Cut - Skinless', 'price': '₹145'}, {'title': 'Premium Chicken-Supreme (Boneless)', 'price': '₹179'}, {'title': 'Chicken Drumsticks', 'price': '₹129'}, {'title': 'Chicken Liver', 'price': '₹29'}, {'title': 'Premium Chicken-Strips (Boneless)', 'price': '₹179'}, {'title': 'Chicken Curry Cut (Skin On)', 'price': '₹115'}, {'title': 'Country Chicken', 'price': '₹389'}, {'title': 'Chicken Boneless (Cubes)', 'price': '₹245'}, {'title': 'Minced Chicken', 'price': '₹250'}, {'title': 'Chicken Curry Cut (Skin Off)', 'price': '₹99'}, {'title': 'Chicken Breast Boneless', 'price': '₹119'}]

CodePudding user response：

This is the most common problem with web scraping: most websites use JavaScript to change or add to the content on the page after loading the initial page. Whatever the JavaScript is supposed to change or load isn't on the page after the initial request.

The same is true for your code. If you look at the actual HTML (not in a browser, in your code), you'll find that it has many fields that angular.js code will be filling in later.

You'll need to load your page using a package like selenium, which uses a browser driver to load the page, execute the JavaScript and make the result available to you. (it does a lot more, like allowing you to navigate the site by clicking it, filling out fields, etc.)

selenium is a complex library with many options, but you can get started with:

pip install selenium

And by downloading a browser driver like Gecko Driver or ChromeDriver

And then something like this will work:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service  # for Chrome
from selenium.webdriver.firefox.service import Service  # for Firefox

service = Service('/path/to/driver')

service.start()

driver = webdriver.Remote(service.service_url)

driver.get('https://www.tendercuts.in/chicken');

# do something with what driver loaded here

driver.quit()

You could just bs4 your way through driver.page_source, but since you now have selenium anyway, you could also look into the ways selenium allows you to find and select elements, like using the built-in XPath functions.