How to scrape text between tags with BeautifulSoup?-CodePudding

I'm trying to extract text string from a tag, the text string I'm interested in is separated by a tag.

<div id="foo">
 <p>
  " Data 1 : Lorem"
  <br>
  <br>
  " Data 2 : Ipsum"
  <br>
 </p>
<div>

Desired output :

Lorem

Using bs4, I'm stuck at :

collection1 = soup.select('div#foo > p:-soup-contains("Data 1 : ")').replace("Data 1 : ","").text.strip()

I don't know how to preceed to set a delimiter for the double quotes or the   tag? Any idea on how to proceed to get the desired output ?

I'm trying to scrap the details information of this page. I've tried :

try:
   collection = soup.select('div#ui-accordion-1-panel-1 > div.tab-content-wrapper > p:-soup-contains("Collection")').text.strip()
except:
   collection = "" 
   print("No Collection")

Expecting to get the whole  tag but exception occured. I've been using this snippet on other scraps with Selenium and it did work.

CodePudding user response：

Here is one way of getting that data:

from bs4 import BeautifulSoup as bs

html = '''
<div id="foo">
 <p>
  " Data 1 : Lorem"
  <br>
  <br>
  " Data 2 : Ipsum"
  <br>
 </p>
<div>
'''

soup = bs(html, 'html.parser')
desired_data = soup.select_one('div[id="foo"] p').contents[0].split(':')[1].replace('"', '').strip()
print(desired_data)

Result:

Lorem

And here is one way (out of many others) to get the collection info from that page:

from bs4 import BeautifulSoup as bs
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

r = requests.get('https://www.messika.com/fr/bracelet-pm-diamant-or-rose-d-vibes-12350-pg', headers=headers)
soup = bs(r.text, 'html.parser')
info = [x for x in soup.select_one('div[] p:-soup-contains("Univers")').contents if 'Collection :' in x][0].split(':')[-1].strip()
print('Collection:', info)

Result:

Collection: D-Vibes

Relevant documentation: https://beautiful-soup-4.readthedocs.io/en/latest/

CodePudding user response：

There are not really " in the string and yes you could use replace(),strip(),... or use a dict that also provide all other features and let you pick from:

data = dict(f.split(' : ') for f in soup.select_one('.tab-content-wrapper > p').stripped_strings if ':' in f)

will lead to a dict like this:

{'Référence': 'Bracelet D-Vibes petit modèle 12350-PG', 'Univers': 'Joaillerie', 'Collection': 'D-Vibes', 'Type de bijou': 'Bracelet diamant', 'Métal': 'Or rose', 'Pierres': 'Diamant', 'Poids total diamants': '0,45 carat, qualité G/VS', 'Longueur chaîne': '18 cm (5 anneaux de fermeture)', 'Catégorie': 'Bracelet femme'}

so you could simply pick your value by key:

data.get('Collection') if data.get('Collection') else 'No Collection'

That will give you:

D-Vibes

or in case there is no Collection

No Collection

Example

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get('https://www.messika.com/fr/bracelet-pm-diamant-or-rose-d-vibes-12350-pg').text)

data = dict(f.split(' : ') for f in soup.select_one('.tab-content-wrapper > p').stripped_strings if ':' in f)

data.get('Collection') if data.get('Collection') else 'No Collection'