import requests
from bs4 import BeautifulSoup
import pandas as pd
articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'
r = requests.get(url)
#print(r.status_code)
soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
#print(articles)
for item in articles:
#h2_headings = item.find('h2').text
h2_headings = item.find_all('h2')
article = {
'H2_Heading': h2_headings,
}
print('Added article:', article)
articlelist.append(article)
df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')
The webpage used within the script has multiple H2 Heading tags that I want to scrape.
I'm looking for a way to simply scrape all the H2 Heading text as shown below:
ANGRY BIRDS 2, ANGRY BIRDS DREAM BLAST, ANGRY BIRDS FRIENDS, ANGRY BIRDS MATCH, ANGRY BIRDS BLAST, ANGRY BIRDS POP
Issue
When i use the syntax h2_headings = item.find('h2').text
it exacts the first h2 heading text as expected.
However, I need to capture all instances of the H2 tag. When I use h2_headings = item.find_all('h2')
it returns the follow results:
{'H2_Heading': [<h2>Angry Birds 2</h2>, <h2>Angry Birds Dream Blast</h2>, <h2>Angry Birds Friends</h2>, <h2>Angry Birds Match</h2>, <h2>Angry Birds Blast</h2>, <h2>Angry Birds POP</h2>]}
Amending the statement to h2_headings = item.find_all('h2').text.strip()
returns the following error:
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Any help would be greatly appreciated.
CodePudding user response:
Follow this answer How to remove h2 tag from html data using beautifulsoup4?
I hope it's help you.
for item in articles:
#h2_headings = item.find('h2').text
h2_headings = item.find_all('h2')
for h in h2_headings:
articlelist.append(h.string)
CodePudding user response:
You can do that as follows:
import requests
from bs4 import BeautifulSoup
import pandas as pd
articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'
r = requests.get(url)
#print(r.status_code)
soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
for item in articles:
h2=', '.join([x.get_text() for x in item.find_all('h2')])
print(h2)
# print('Added article:', article)
# articlelist.append(article)
# df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')
Output:
Angry Birds 2, Angry Birds Dream Blast, Angry Birds Friends, Angry Birds Match, Angry Birds Blast, Angry Birds POP