Home > OS >  How to scrape a specific <p> with no class?
How to scrape a specific <p> with no class?

Time:12-04

I am very new to web scraping. I am going to scrape

<div class="p3">
<div>
<span class="poptip"><strong>BP</strong></span>
<p>110</p></div>
<div>
<span class="poptip"><strong>Weight</strong></span>
<p>55</p></div>
<div>
<span class="poptip"><strong>Age</strong></span>
<p>28</p></div>
<div>
<span class="poptip"><strong>Height</strong></span>
<p>155</p></div>
</div>

What I am trying to scrape is the 155.(which is the height)

I thought of getting all p.text elements into a list and take the last one out of it. But when I try, only the 110 get as output. (Not even a list of 110, 55, 28, 155) How can I get all the p.text into a array?

This is my try:

p_list=[]
data =soup.find_all('div', class_='p3')
for info in data:
  p_data = para.find('p').text
  p_list.append(p_data)
  print(p_list)

Or, is there a way to get the text in <p> tag if the text in the prior span of the <p> tag is 'Height'?

Your help is highly appreciated as a beginner.

CodePudding user response:

You can use class_=False as your attribute selector

If you want to get just the height, then just grab the last element:

soup = bs4.BeautifulSoup(html, 'lxml')
height = soup.find('div', class_='p3').findAll('p', class_=False)[-1]
print(height.text)

>>> 155

If you want to select all of the elements, then you can build a mapping using zip

soup = bs4.BeautifulSoup(html, 'lxml')
div = soup.find('div', class_='p3')

tags = div.findAll('span')
nums = div.findAll('p', class_=False)

attrs = {k.text: int(v.text) for k, v, in zip(tags, nums)}
print(attrs)
>>> {'BP': 110, 'Weight': 55, 'Age': 28, 'Height': 155}

CodePudding user response:

You can get all the p.text into an array as follows:

from bs4 import BeautifulSoup
html = '''
<div >
 <div>
  <span >
   <strong>
    BP
   </strong>
  </span>
  <p>
   110
  </p>
 </div>
 <div>
  <span >
   <strong>
    Weight
   </strong>
  </span>
  <p>
   55
  </p>
 </div>
 <div>
  <span >
   <strong>
    Age
   </strong>
  </span>
  <p>
   28
  </p>
 </div>
 <div>
  <span >
   <strong>
    Height
   </strong>
  </span>
  <p>
   155
  </p>
 </div>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
#print(soup.prettify())
p_tag = soup.select('span.poptip  p')
array=[p.get_text(strip=True) for p in p_tag]
print(array)

Output

['110', '55', '28', '155']

CodePudding user response:

What I am trying to scrape is the 155.(which is the height)

Option#1

To get the text of the last <p> in <div> with class "p3" you can go with css selectors:

soup.select_one('div.p3 :last-child p').text
Option#2

As alternativ you can create a list of texts of all <p>:

[x.text for x in soup.select('div.p3 p')][-1]
Option#3

Or, is there a way to get the text in <p> tag if the text in the prior span of the <p> tag is 'Height'?

Locate the parent of <strong> that contains "Height" and its direct <p>:

soup.select_one('div:has(>:-soup-contains(Height) :not(class)) >p').text
  • Related