I am trying to scrape part of a text in python using beautifulsoup. To give u an example: https://www.weisshaus.de/details/monkey-47-dry-gin-47-vol.-0-50l.
on this page they add the alcoholpercentage in the producttitle and i wanna add only the alcoholpercentage to the variable "alcoholpercentage".
I am able to scrape the product title using:
try:
productnaam = getTextFromHTMLItem(soup.find('h1', {'class':'product--title'}))
except:
productnaam = ""
Where the function getTextFromHTMLItem is as follows:
def getTextFromHTMLItem(HTMLItem):
try:
return HTMLItem.text
except:
return " "
But how do i extract the alcoholpercentage part out of this now?
Thanks in advance for the help :)
CodePudding user response:
To extract the alcoholpercentage part, you also can use split() method
from bs4 import BeautifulSoup
import requests
url = 'https://www.weisshaus.de/details/monkey-47-dry-gin-47-vol.-0-50l'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
txt = soup.find('div', {'class':'product--title'}).h1
print(txt.text.split()[4])
Output:
47%
CodePudding user response:
You could use regex
to find the percentage of alcohol in the title a generic way:
\d (?:\.\d )?%
Be aware, that if there are more than one percentages in the title you have to find a better pattern or decide by index
import re
s = 'Monkey 47 Dry Gin 47% vol. 0,50l'
re.findall('\d (?:\.\d )?%',s)
Example
In this case there is only one percentage so you could go with:
from bs4 import BeautifulSoup
import requests, re
url = 'https://www.weisshaus.de/details/monkey-47-dry-gin-47-vol.-0-50l'
soup = BeautifulSoup(requests.get(url).text)
print(re.findall('\d (?:\.\d )?%',soup.h1.text)[-1])