so I am having an issue with a web scraper I am making for a website I'm developing. The main issue I am having is when trying to get a header for a product that is in an h1 format, it keeps responding with this:
<h1 >CHERRY MX SILENT RED(10pcs)</h1>
I just want the Cherry Mx Silent Red part and not all of the other stuff.
Here is the code for my web scraper:
from bs4 import BeautifulSoup
URL = 'https://kbdfans.com/collections/cherry-switches/products/cherry-mx-silent-red'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class' : 'product-detail__title small-title'})
print(title)
CodePudding user response:
Try this :
title.get_text()
Your title is not a string, it's an object
From BeautifulSoup documentation:
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
For your title variable, you may refer to bs4.element.Tag documentation, and if you have a doubt you can always print all available methods of that object like this:
print(dir(title))
CodePudding user response:
Getting the text from your <h1>
just use .text
and .get_text()
when you need to pass custom arguments to strip
whitespaces,... or add an seperator (e.g. title.get_text(strip=True, seperator=',')
).
print(title.text)
or
print(title.get_text())
Example
from bs4 import BeautifulSoup
URL = 'https://kbdfans.com/collections/cherry-switches/products/cherry-mx-silent-red'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class' : 'product-detail__title small-title'})
print(title.text)
Output
CHERRY MX SILENT RED(10pcs)