Home > Back-end >  Python Beautifulsoup Scrape part of a text only
Python Beautifulsoup Scrape part of a text only

Time:08-15

I am trying to scrape part of a text in python using beautifulsoup. To give u an example: https://www.weisshaus.de/details/monkey-47-dry-gin-47-vol.-0-50l.

on this page they add the alcoholpercentage in the producttitle and i wanna add only the alcoholpercentage to the variable "alcoholpercentage".

I am able to scrape the product title using:

try:
                    productnaam = getTextFromHTMLItem(soup.find('h1', {'class':'product--title'}))
                except:
                    productnaam = ""

Where the function getTextFromHTMLItem is as follows:

def getTextFromHTMLItem(HTMLItem):
try:
    return HTMLItem.text
except:
    return " "

But how do i extract the alcoholpercentage part out of this now?

Thanks in advance for the help :)

CodePudding user response:

To extract the alcoholpercentage part, you also can use split() method

from bs4 import BeautifulSoup 
import requests
url = 'https://www.weisshaus.de/details/monkey-47-dry-gin-47-vol.-0-50l'
res = requests.get(url) 
soup = BeautifulSoup(res.text, 'lxml')

txt = soup.find('div', {'class':'product--title'}).h1

print(txt.text.split()[4])

Output:

47%

CodePudding user response:

You could use regex to find the percentage of alcohol in the title a generic way:

\d (?:\.\d )?%

Be aware, that if there are more than one percentages in the title you have to find a better pattern or decide by index

import re

s = 'Monkey 47 Dry Gin 47% vol. 0,50l'
re.findall('\d (?:\.\d )?%',s)
Example

In this case there is only one percentage so you could go with:

from bs4 import BeautifulSoup 
import requests, re
url = 'https://www.weisshaus.de/details/monkey-47-dry-gin-47-vol.-0-50l'
soup = BeautifulSoup(requests.get(url).text)

print(re.findall('\d (?:\.\d )?%',soup.h1.text)[-1])
  • Related