I'm trying to extract text string from a <p> tag, the text string I'm interested in is separated by a <br> tag.
<div id="foo">
<p>
" Data 1 : Lorem"
<br>
<br>
" Data 2 : Ipsum"
<br>
</p>
<div>
Desired output :
Lorem
Using bs4
, I'm stuck at :
collection1 = soup.select('div#foo > p:-soup-contains("Data 1 : ")').replace("Data 1 : ","").text.strip()
I don't know how to preceed to set a delimiter for the double quotes or the <br>
tag? Any idea on how to proceed to get the desired output ?
I'm trying to scrap the details information of this page. I've tried :
try:
collection = soup.select('div#ui-accordion-1-panel-1 > div.tab-content-wrapper > p:-soup-contains("Collection")').text.strip()
except:
collection = ""
print("No Collection")
Expecting to get the whole <p>
tag but exception occured. I've been using this snippet on other scraps with Selenium and it did work.
CodePudding user response:
Here is one way of getting that data:
from bs4 import BeautifulSoup as bs
html = '''
<div id="foo">
<p>
" Data 1 : Lorem"
<br>
<br>
" Data 2 : Ipsum"
<br>
</p>
<div>
'''
soup = bs(html, 'html.parser')
desired_data = soup.select_one('div[id="foo"] p').contents[0].split(':')[1].replace('"', '').strip()
print(desired_data)
Result:
Lorem
And here is one way (out of many others) to get the collection info from that page:
from bs4 import BeautifulSoup as bs
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
r = requests.get('https://www.messika.com/fr/bracelet-pm-diamant-or-rose-d-vibes-12350-pg', headers=headers)
soup = bs(r.text, 'html.parser')
info = [x for x in soup.select_one('div[] p:-soup-contains("Univers")').contents if 'Collection :' in x][0].split(':')[-1].strip()
print('Collection:', info)
Result:
Collection: D-Vibes
Relevant documentation: https://beautiful-soup-4.readthedocs.io/en/latest/
CodePudding user response:
There are not really "
in the string and yes you could use replace()
,strip()
,... or use a dict
that also provide all other features and let you pick from:
data = dict(f.split(' : ') for f in soup.select_one('.tab-content-wrapper > p').stripped_strings if ':' in f)
will lead to a dict
like this:
{'Référence': 'Bracelet D-Vibes petit modèle 12350-PG', 'Univers': 'Joaillerie', 'Collection': 'D-Vibes', 'Type de bijou': 'Bracelet diamant', 'Métal': 'Or rose', 'Pierres': 'Diamant', 'Poids total diamants': '0,45 carat, qualité G/VS', 'Longueur chaîne': '18 cm (5 anneaux de fermeture)', 'Catégorie': 'Bracelet femme'}
so you could simply pick your value
by key
:
data.get('Collection') if data.get('Collection') else 'No Collection'
That will give you:
D-Vibes
or in case there is no Collection
No Collection
Example
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get('https://www.messika.com/fr/bracelet-pm-diamant-or-rose-d-vibes-12350-pg').text)
data = dict(f.split(' : ') for f in soup.select_one('.tab-content-wrapper > p').stripped_strings if ':' in f)
data.get('Collection') if data.get('Collection') else 'No Collection'