Home > Blockchain >  How to scrape text between <br> tags with BeautifulSoup?
How to scrape text between <br> tags with BeautifulSoup?

Time:11-19

I'm trying to extract text string from a <p> tag, the text string I'm interested in is separated by a <br> tag.

<div id="foo">
 <p>
  " Data 1 : Lorem"
  <br>
  <br>
  " Data 2 : Ipsum"
  <br>
 </p>
<div>

Desired output :

Lorem

Using bs4, I'm stuck at :

collection1 = soup.select('div#foo > p:-soup-contains("Data 1 : ")').replace("Data 1 : ","").text.strip()

I don't know how to preceed to set a delimiter for the double quotes or the <br> tag? Any idea on how to proceed to get the desired output ?

I'm trying to scrap the details information of this page. I've tried :

try:
   collection = soup.select('div#ui-accordion-1-panel-1 > div.tab-content-wrapper > p:-soup-contains("Collection")').text.strip()
except:
   collection = "" 
   print("No Collection")              

Expecting to get the whole <p> tag but exception occured. I've been using this snippet on other scraps with Selenium and it did work.

CodePudding user response:

Here is one way of getting that data:

from bs4 import BeautifulSoup as bs

html = '''
<div id="foo">
 <p>
  " Data 1 : Lorem"
  <br>
  <br>
  " Data 2 : Ipsum"
  <br>
 </p>
<div>
'''

soup = bs(html, 'html.parser')
desired_data = soup.select_one('div[id="foo"] p').contents[0].split(':')[1].replace('"', '').strip()
print(desired_data)

Result:

Lorem

And here is one way (out of many others) to get the collection info from that page:

from bs4 import BeautifulSoup as bs
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

r = requests.get('https://www.messika.com/fr/bracelet-pm-diamant-or-rose-d-vibes-12350-pg', headers=headers)
soup = bs(r.text, 'html.parser')
info = [x for x in soup.select_one('div[] p:-soup-contains("Univers")').contents if 'Collection :' in x][0].split(':')[-1].strip()
print('Collection:', info)

Result:

Collection: D-Vibes

Relevant documentation: https://beautiful-soup-4.readthedocs.io/en/latest/

CodePudding user response:

There are not really " in the string and yes you could use replace(),strip(),... or use a dict that also provide all other features and let you pick from:

data = dict(f.split(' : ') for f in soup.select_one('.tab-content-wrapper > p').stripped_strings if ':' in f)

will lead to a dict like this:

{'Référence': 'Bracelet D-Vibes petit modèle 12350-PG', 'Univers': 'Joaillerie', 'Collection': 'D-Vibes', 'Type de bijou': 'Bracelet diamant', 'Métal': 'Or rose', 'Pierres': 'Diamant', 'Poids total diamants': '0,45 carat, qualité G/VS', 'Longueur chaîne': '18 cm (5 anneaux de fermeture)', 'Catégorie': 'Bracelet femme'}

so you could simply pick your value by key:

data.get('Collection') if data.get('Collection') else 'No Collection'

That will give you:

D-Vibes

or in case there is no Collection

No Collection

Example

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get('https://www.messika.com/fr/bracelet-pm-diamant-or-rose-d-vibes-12350-pg').text)

data = dict(f.split(' : ') for f in soup.select_one('.tab-content-wrapper > p').stripped_strings if ':' in f)

data.get('Collection') if data.get('Collection') else 'No Collection'
  • Related