I was trying to build a web-scraper for data collection for a research project at uni. However, I am not able to scrape the whole website, as there seems to be a problem with soup.find_all
...
This is what I've come up with so far:
from bs4 import BeautifulSoup
import requests
from csv import writer
url= "https://pubmed.ncbi.nlm.nih.gov/?term=("spontaneous intracranial hypotension"[All Fields] OR "spontaneous cerebrospinal fluid leak"[All Fields] OR "cerebrospinal fluid hypovolemia"[All Fields] OR "cerebrospinal fluid hypovolemia syndrome"[All Fields] OR "Hypoliquorrhea"[All Fields] OR "Spontaneous spinal cerebrospinal fluid leak"[All Fields]) NOT "letter to the editor"[All Fields]&filter=dates.1000/1/1-2022/3/31&filter=lang.english&ac=no&format=abstract&sort=date&size=200"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('article', class_="article-overview")
with open('disstest.csv', 'w', encoding= 'utf8', newline='') as f:
thewriter = writer(f)
header = ['Herkunftsland', 'Journal', 'Anzahl Zitationen']
thewriter.writerow(header)
for list in lists:
herkunftsland = lists.find('ul', class_="item-list").text.replace('\n','')
journal = lists.find('div', class_="article-source").text.replace('\n', '')
zitationen = lists.find('li', class_="references-count").text.replace('\n', '')
info = [herkunftsland, journal, zitationen]
thewriter.writerow(info)
I am getting the following messages:
Traceback (most recent call last):
File "/Users/***/Documents/Test/scrape.py", line 17, in <module>
herkunftsland = lists.find('ul', class_="item-list").text.replace('\n','')
File"/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/bs4/element.py", line 2289, in __getattr__
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'.
You're probably treating a list of elements like a single element.
Did you call find_all() when you meant to call find()?
CodePudding user response:
It looks like you made a mistake and use the lists
list to search, but you should use _list
for _list in lists:
herkunftsland = _list.find('ul', class_="item-list").text.replace('\n', '')
journal = _list.find('div', class_="article-source").text.replace('\n', '')
zitationen = _list.find('li', class_="references-count").text.replace('\n', '')
info = [herkunftsland, journal, zitationen]
thewriter.writerow(info)
CodePudding user response:
As mentioned by @Charls Ken you used the wrong variable lists
to extract your data and you should also avoid using reserved keywords like list
.
Would also recommend to check if elements are available before calling methods on them, to avoid AttributeError
s.
for _list in lists:
herkunftsland = e.text.replace('\n','') if (e:= _list.find('ul', class_="item-list")) else None
journal = e.text.replace('\n','').strip() if (e:= _list.find('div', class_="article-source")) else None
zitationen = e.text.replace('\n','').strip() if (e:= _list.find('li', class_="references-count")) else None
info = [herkunftsland, journal, zitationen]
Note: This uses walrus operator
that requires Python 3.8
or later to work.
To go without walrus operator
:
journal = _list.find('div', class_="article-source").text.replace('\n','').strip() if _list.find('div', class_="article-source") else None