I need to get the individual url for each country after the "a href=" under the "div" class of "well span4". For example,I need to get https://www.rulac.org/browse/countries/myanmar and https://www.rulac.org/browse/countries/the-netherlands and every url after "a href=" (as shown in the partial html structure below.
since the "a href=" is not under any class, how do I conduct a search and get all the countries url?
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all("div", class_="well span4")
# Partial html structure shown as below
[<div >
<a href="https://www.rulac.org/browse/countries/myanmar">
<div >
<img alt="Myanmar" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=19.7633057,96.07851040000003&format=png&style=feature:administrative.locality|element:all|visibility:off&style=feature:water|element:all|hue:0xEDF9FF|lightness:80|saturation:9&style=feature:road|element:all|visibility:off&style=feature:landscape|element:all|hue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Myanmar"/>
<img src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Myanmar</h2>
<a href="https://www.rulac.org/browse/countries/myanmar">Read on <i ></i></a>
</div>,
<div >
<a href="https://www.rulac.org/browse/countries/the-netherlands">
<div >
<img alt="Netherlands" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=52.203566364441,5.7275408506393&format=png&style=feature:administrative.locality|element:all|visibility:off&style=feature:water|element:all|hue:0xEDF9FF|lightness:80|saturation:9&style=feature:road|element:all|visibility:off&style=feature:landscape|element:all|hue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Netherlands"/>
<img src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Netherlands</h2>
<a href="https://www.rulac.org/browse/countries/the-netherlands">Read on <i ></i></a>
</div>,
<div >
<a href="https://www.rulac.org/browse/countries/niger">
<div >
<img alt="Niger" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=13.5115963,2.1253854000000274&format=png&style=feature:administrative.locality|element:all|visibility:off&style=feature:water|element:all|hue:0xEDF9FF|lightness:80|saturation:9&style=feature:road|element:all|visibility:off&style=feature:landscape|element:all|hue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Niger"/>
<img src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Niger</h2>
<a href="https://www.rulac.org/browse/countries/niger">Read on <i ></i></a>
</div>,
CodePudding user response:
You can use soup.select()
with a CSS selector to get all <a>
elements of class btn
that are children of <div>
s with classes well
and span4
. Like this:
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.select("div.well.span4 > a.btn")
# get all hrefs in a list and print it
hrefs = [el['href'] for el in res]
for href in hrefs:
print(href)