Home > Mobile >  How to Extract Multiple H2 Tags Using BeautifulSoup
How to Extract Multiple H2 Tags Using BeautifulSoup

Time:10-09

import requests
from bs4 import BeautifulSoup
import pandas as pd

articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'

r = requests.get(url)
#print(r.status_code)

soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
#print(articles)

for item in articles:
  #h2_headings = item.find('h2').text
  h2_headings = item.find_all('h2')

  article = {
    'H2_Heading': h2_headings,
  }

  print('Added article:', article)
  articlelist.append(article)

df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')

The webpage used within the script has multiple H2 Heading tags that I want to scrape.

I'm looking for a way to simply scrape all the H2 Heading text as shown below:

ANGRY BIRDS 2, ANGRY BIRDS DREAM BLAST, ANGRY BIRDS FRIENDS, ANGRY BIRDS MATCH, ANGRY BIRDS BLAST, ANGRY BIRDS POP

Issue

When i use the syntax h2_headings = item.find('h2').text it exacts the first h2 heading text as expected.

However, I need to capture all instances of the H2 tag. When I use h2_headings = item.find_all('h2') it returns the follow results:

{'H2_Heading': [<h2>Angry Birds 2</h2>, <h2>Angry Birds Dream Blast</h2>, <h2>Angry Birds Friends</h2>, <h2>Angry Birds Match</h2>, <h2>Angry Birds Blast</h2>, <h2>Angry Birds POP</h2>]}

Amending the statement to h2_headings = item.find_all('h2').text.strip() returns the following error:

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Any help would be greatly appreciated.

CodePudding user response:

Follow this answer How to remove h2 tag from html data using beautifulsoup4?

I hope it's help you.

for item in articles:
  #h2_headings = item.find('h2').text
  h2_headings = item.find_all('h2')

  for h in h2_headings:
    articlelist.append(h.string)

CodePudding user response:

You can do that as follows:

import requests
from bs4 import BeautifulSoup
import pandas as pd

articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'

r = requests.get(url)
#print(r.status_code)

soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')


for item in articles:
    h2=', '.join([x.get_text() for x in item.find_all('h2')])
    print(h2)
  

#   print('Added article:', article)
#   articlelist.append(article)

# df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')

Output:

Angry Birds 2, Angry Birds Dream Blast, Angry Birds Friends, Angry Birds Match, Angry Birds Blast, Angry Birds POP
  • Related