I am very new to web scraping. I am going to scrape
<div class="p3">
<div>
<span class="poptip"><strong>BP</strong></span>
<p>110</p></div>
<div>
<span class="poptip"><strong>Weight</strong></span>
<p>55</p></div>
<div>
<span class="poptip"><strong>Age</strong></span>
<p>28</p></div>
<div>
<span class="poptip"><strong>Height</strong></span>
<p>155</p></div>
</div>
What I am trying to scrape is the 155.(which is the height)
I thought of getting all p.text elements into a list and take the last one out of it. But when I try, only the 110 get as output. (Not even a list of 110, 55, 28, 155) How can I get all the p.text into a array?
This is my try:
p_list=[]
data =soup.find_all('div', class_='p3')
for info in data:
p_data = para.find('p').text
p_list.append(p_data)
print(p_list)
Or, is there a way to get the text in <p>
tag if the text in the prior span of the <p>
tag is 'Height'?
Your help is highly appreciated as a beginner.
CodePudding user response:
You can use class_=False
as your attribute selector
If you want to get just the height, then just grab the last element:
soup = bs4.BeautifulSoup(html, 'lxml')
height = soup.find('div', class_='p3').findAll('p', class_=False)[-1]
print(height.text)
>>> 155
If you want to select all of the elements, then you can build a mapping using zip
soup = bs4.BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='p3')
tags = div.findAll('span')
nums = div.findAll('p', class_=False)
attrs = {k.text: int(v.text) for k, v, in zip(tags, nums)}
print(attrs)
>>> {'BP': 110, 'Weight': 55, 'Age': 28, 'Height': 155}
CodePudding user response:
You can get all the p.text into an array as follows:
from bs4 import BeautifulSoup
html = '''
<div >
<div>
<span >
<strong>
BP
</strong>
</span>
<p>
110
</p>
</div>
<div>
<span >
<strong>
Weight
</strong>
</span>
<p>
55
</p>
</div>
<div>
<span >
<strong>
Age
</strong>
</span>
<p>
28
</p>
</div>
<div>
<span >
<strong>
Height
</strong>
</span>
<p>
155
</p>
</div>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
#print(soup.prettify())
p_tag = soup.select('span.poptip p')
array=[p.get_text(strip=True) for p in p_tag]
print(array)
Output
['110', '55', '28', '155']
CodePudding user response:
What I am trying to scrape is the 155.(which is the height)
Option#1
To get the text of the last <p>
in <div>
with class "p3" you can go with css selectors
:
soup.select_one('div.p3 :last-child p').text
Option#2
As alternativ you can create a list of texts of all <p>
:
[x.text for x in soup.select('div.p3 p')][-1]
Option#3
Or, is there a way to get the text in
<p>
tag if the text in the prior span of the<p>
tag is 'Height'?
Locate the parent
of <strong>
that contains "Height" and its direct <p>
:
soup.select_one('div:has(>:-soup-contains(Height) :not(class)) >p').text