Home > front end >  Using BeautifulSoup to collect urls from html code
Using BeautifulSoup to collect urls from html code

Time:11-24

i have collected a list of links from a folder of documents that essentially is wikipedia pages. I eventually realized that my list of links is incomplete, because my code only collects a few of the links from each wikipedia page. My goal is to get all links and then filter it afterwards. I should end up with a list of links to train related accidents. Keywords for such accidents in the links varies between disaster, tragedy etc. i dont know them beforehand.

My input is

list_of_urls = []

for file in files:     
    text = open('files_overview/' file, encoding="utf-8").read()
    soup = BeautifulSoup(text, features="lxml")
    for item in soup.findAll("div", attrs={'class':'mw-content-ltr'}):                 
        url = item.find('a', attrs={'class':'href'=="accident"}): 
#If i dont add something, like "accident" it gives me a syntax error.. 
        urls= url.get("href")               
        urls1="https://en.wikipedia.org" urls   
        list_of_urls.append(urls1)

HTML code from one of my documents, wherein multiple links lies are given below:

</div><div class="mw-category-generated" lang="en" dir="ltr"><div id="mw-pages">
<h2><span class="anchor" id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of  3 total. This list may not reflect recent changes (<a href="/wiki/Wikipedia:FAQ/Categorization#Why_might_a_category_list_not_be_up_to_date?" title="Wikipedia:FAQ/Categorization">learn more</a>).
</p><div lang="en" dir="ltr" class="mw-content-ltr"><h3>A</h3>
<ul><li><a href="/wiki/Atherstone_rail_accident" title="Atherstone rail accident">Atherstone rail accident</a></li></ul><h3>B</h3>
<ul><li><a href="/wiki/Bull_bridge_accident" title="Bull bridge accident">Bull bridge accident</a></li></ul><h3>H</h3>
<ul><li><span class="redirect-in-category"><a href="/wiki/Helmshore_rail_accident" class="mw-redirect" title="Helmshore rail accident">Helmshore rail accident</a></span></li></ul></div>
</div></div><noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div class="printfooter">Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&amp;oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&amp;oldid=895698968</a>"</div></div>
        <div id="catlinks" class="catlinks" data-mw="interface"><div id="mw-normal-catlinks" 

From the above, i manage to get Atherstone_rail_accident, but not bull_bridge nor helmshore. Does anyone have a better approach?

Thank you for your time

CodePudding user response:

What happens?

You just iterate over one element from result set of soup.findAll("div", attrs={'class':'mw-content-ltr'}), thats why you only get the first link.

Example

list_of_urls = []
for file in files:     
    text = open('files_overview/' file, encoding="utf-8").read()
    soup = BeautifulSoup(text, features="lxml")

    for a in soup.select('div.mw-content-ltr a'):
        list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')

How to fix?

Instead of selecting the <div> select all the links in your <div> and iterate over it:

for a in soup.select('div.mw-content-ltr a'):
    list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')

Output

['https://en.wikipedia.org/wiki/Atherstone_rail_accident',
 'https://en.wikipedia.org/wiki/Bull_bridge_accident',
 'https://en.wikipedia.org/wiki/Helmshore_rail_accident']

EDIT

Adding the prefix https://en.wikipedia.org later in the process just skip this task while appending the href to your list:

for a in soup.select('div.mw-content-ltr a'):
    list_of_urls.append(a["href"])

If you like to request the urls in a second step you can do it like this:

for url in list_of_urls:
    response = requests.get(f'https://en.wikipedia.org{url}')

Or if just need a list with full urls you append it with list comprehension:

list_of_urls = [f'https://en.wikipedia.org{a["href"]}' for a in list_of_urls]

CodePudding user response:

You can do like this.

  • First find all the <div> with class name as mw-content-ltr using .find_all()
  • For each <div> obtained above, find all the <a> tags using .find_all(). This will give you a list of <a> for each <div>.
  • Iterate over and extract the href from the above list of <a> tags.

Here is the code.

from bs4 import BeautifulSoup

s = """
<div  lang="en" dir="ltr">
   <div id="mw-pages">
      <h2><span  id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
      <p>The following 3 pages are in this category, out of  3 total. This list may not reflect recent changes (<a href="/wiki/Wikipedia:FAQ/Categorization#Why_might_a_category_list_not_be_up_to_date?" title="Wikipedia:FAQ/Categorization">learn more</a>).</p>
      <div lang="en" dir="ltr" >
         <h3>A</h3>
         <ul>
            <li><a href="/wiki/Atherstone_rail_accident" title="Atherstone rail accident">Atherstone rail accident</a></li>
         </ul>
         <h3>B</h3>
         <ul>
            <li><a href="/wiki/Bull_bridge_accident" title="Bull bridge accident">Bull bridge accident</a></li>
         </ul>
         <h3>H</h3>
         <ul>
            <li><span ><a href="/wiki/Helmshore_rail_accident"  title="Helmshore rail accident">Helmshore rail accident</a></span></li>
         </ul>
      </div>
   </div>
</div>
<noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div >Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&amp;oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&amp;oldid=895698968</a>"</div>
</div>
<div id="catlinks"  data-mw="interface">
"""
soup = BeautifulSoup(s, 'lxml')

divs = soup.find_all('div', class_='mw-content-ltr')

for div in divs:
    for a in div.find_all('a'):
        print(a['href'])
/wiki/Atherstone_rail_accident
/wiki/Bull_bridge_accident
/wiki/Helmshore_rail_accident
  • Related