Home > Software engineering >  bs4 - how to use find or find_all to get specific content from an url
bs4 - how to use find or find_all to get specific content from an url

Time:09-21

I need to get the individual url for each country after the "a href=" under the "div" class of "well span4". For example,I need to get https://www.rulac.org/browse/countries/myanmar and https://www.rulac.org/browse/countries/the-netherlands and every url after "a href=" (as shown in the partial html structure below.

since the "a href=" is not under any class, how do I conduct a search and get all the countries url?

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup


url = "https://www.rulac.org/browse/countries/P36"  
resp = requests.get(url)

soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all("div", class_="well span4")

# Partial html structure shown as below ​
[<div >
 <a href="https://www.rulac.org/browse/countries/myanmar">
 <div >
 <img alt="Myanmar" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&amp;zoom=5¢er=19.7633057,96.07851040000003&amp;format=png&amp;style=feature:administrative.locality|element:all|visibility:off&amp;style=feature:water|element:all|hue:0xEDF9FF|lightness:80|saturation:9&amp;style=feature:road|element:all|visibility:off&amp;style=feature:landscape|element:all|hue:0xE0EADC&amp;key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Myanmar"/>
 <img  src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
 </div>
 </a>
 <h2>Myanmar</h2>
 <a  href="https://www.rulac.org/browse/countries/myanmar">Read on <i ></i></a>
 </div>,
 <div >
 <a href="https://www.rulac.org/browse/countries/the-netherlands">
 <div >
 <img alt="Netherlands" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&amp;zoom=5¢er=52.203566364441,5.7275408506393&amp;format=png&amp;style=feature:administrative.locality|element:all|visibility:off&amp;style=feature:water|element:all|hue:0xEDF9FF|lightness:80|saturation:9&amp;style=feature:road|element:all|visibility:off&amp;style=feature:landscape|element:all|hue:0xE0EADC&amp;key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Netherlands"/>
 <img  src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
 </div>
 </a>
 <h2>Netherlands</h2>
 <a  href="https://www.rulac.org/browse/countries/the-netherlands">Read on <i ></i></a>
 </div>,
 <div >
 <a href="https://www.rulac.org/browse/countries/niger">
 <div >
 <img alt="Niger" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&amp;zoom=5¢er=13.5115963,2.1253854000000274&amp;format=png&amp;style=feature:administrative.locality|element:all|visibility:off&amp;style=feature:water|element:all|hue:0xEDF9FF|lightness:80|saturation:9&amp;style=feature:road|element:all|visibility:off&amp;style=feature:landscape|element:all|hue:0xE0EADC&amp;key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Niger"/>
 <img  src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
 </div>
 </a>
 <h2>Niger</h2>
 <a  href="https://www.rulac.org/browse/countries/niger">Read on <i ></i></a>
 </div>,

CodePudding user response:

You can use soup.select() with a CSS selector to get all <a> elements of class btn that are children of <div>s with classes well and span4. Like this:

import requests
from bs4 import BeautifulSoup


url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)

soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.select("div.well.span4 > a.btn")

# get all hrefs in a list and print it
hrefs = [el['href'] for el in res]
for href in hrefs:
    print(href)
  • Related