i have collected a list of links from a folder of documents that essentially is wikipedia pages. I eventually realized that my list of links is incomplete, because my code only collects a few of the links from each wikipedia page. My goal is to get all links and then filter it afterwards. I should end up with a list of links to train related accidents. Keywords for such accidents in the links varies between disaster, tragedy etc. i dont know them beforehand.
My input is
list_of_urls = []
for file in files:
text = open('files_overview/' file, encoding="utf-8").read()
soup = BeautifulSoup(text, features="lxml")
for item in soup.findAll("div", attrs={'class':'mw-content-ltr'}):
url = item.find('a', attrs={'class':'href'=="accident"}):
#If i dont add something, like "accident" it gives me a syntax error..
urls= url.get("href")
urls1="https://en.wikipedia.org" urls
list_of_urls.append(urls1)
HTML code from one of my documents, wherein multiple links lies are given below:
</div><div class="mw-category-generated" lang="en" dir="ltr"><div id="mw-pages">
<h2><span class="anchor" id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of 3 total. This list may not reflect recent changes (<a href="/wiki/Wikipedia:FAQ/Categorization#Why_might_a_category_list_not_be_up_to_date?" title="Wikipedia:FAQ/Categorization">learn more</a>).
</p><div lang="en" dir="ltr" class="mw-content-ltr"><h3>A</h3>
<ul><li><a href="/wiki/Atherstone_rail_accident" title="Atherstone rail accident">Atherstone rail accident</a></li></ul><h3>B</h3>
<ul><li><a href="/wiki/Bull_bridge_accident" title="Bull bridge accident">Bull bridge accident</a></li></ul><h3>H</h3>
<ul><li><span class="redirect-in-category"><a href="/wiki/Helmshore_rail_accident" class="mw-redirect" title="Helmshore rail accident">Helmshore rail accident</a></span></li></ul></div>
</div></div><noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div class="printfooter">Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968</a>"</div></div>
<div id="catlinks" class="catlinks" data-mw="interface"><div id="mw-normal-catlinks"
From the above, i manage to get Atherstone_rail_accident, but not bull_bridge nor helmshore. Does anyone have a better approach?
Thank you for your time
CodePudding user response:
What happens?
You just iterate over one element from result set of soup.findAll("div", attrs={'class':'mw-content-ltr'})
, thats why you only get the first link.
Example
list_of_urls = []
for file in files:
text = open('files_overview/' file, encoding="utf-8").read()
soup = BeautifulSoup(text, features="lxml")
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')
How to fix?
Instead of selecting the <div>
select all the links in your <div>
and iterate over it:
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')
Output
['https://en.wikipedia.org/wiki/Atherstone_rail_accident',
'https://en.wikipedia.org/wiki/Bull_bridge_accident',
'https://en.wikipedia.org/wiki/Helmshore_rail_accident']
EDIT
Adding the prefix https://en.wikipedia.org
later in the process just skip this task while appending the href
to your list:
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(a["href"])
If you like to request the urls in a second step you can do it like this:
for url in list_of_urls:
response = requests.get(f'https://en.wikipedia.org{url}')
Or if just need a list with full urls you append it with list comprehension
:
list_of_urls = [f'https://en.wikipedia.org{a["href"]}' for a in list_of_urls]
CodePudding user response:
You can do like this.
- First find all the
<div>
with class name asmw-content-ltr
using.find_all()
- For each
<div>
obtained above, find all the<a>
tags using.find_all()
. This will give you a list of<a>
for each<div>
. - Iterate over and extract the
href
from the above list of<a>
tags.
Here is the code.
from bs4 import BeautifulSoup
s = """
<div lang="en" dir="ltr">
<div id="mw-pages">
<h2><span id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of 3 total. This list may not reflect recent changes (<a href="/wiki/Wikipedia:FAQ/Categorization#Why_might_a_category_list_not_be_up_to_date?" title="Wikipedia:FAQ/Categorization">learn more</a>).</p>
<div lang="en" dir="ltr" >
<h3>A</h3>
<ul>
<li><a href="/wiki/Atherstone_rail_accident" title="Atherstone rail accident">Atherstone rail accident</a></li>
</ul>
<h3>B</h3>
<ul>
<li><a href="/wiki/Bull_bridge_accident" title="Bull bridge accident">Bull bridge accident</a></li>
</ul>
<h3>H</h3>
<ul>
<li><span ><a href="/wiki/Helmshore_rail_accident" title="Helmshore rail accident">Helmshore rail accident</a></span></li>
</ul>
</div>
</div>
</div>
<noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div >Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968</a>"</div>
</div>
<div id="catlinks" data-mw="interface">
"""
soup = BeautifulSoup(s, 'lxml')
divs = soup.find_all('div', class_='mw-content-ltr')
for div in divs:
for a in div.find_all('a'):
print(a['href'])
/wiki/Atherstone_rail_accident
/wiki/Bull_bridge_accident
/wiki/Helmshore_rail_accident