I have the following soup:
next ... From this I want to extract the href, "some_url" this I want to extract the href, "some_url" and the whole list of the pages that are listed on this page:
https://www.catholic-hierarchy.org/diocese/laa.html
note: there are a whole lot of links to sub-pages: which i need to parse. at the moment: getting all the data out it : -dioceses -Urls -description -contact-data -etc. etx.
the following is one way of getting that information, in an async fashion (should work on Colab notebooks). I got thet dioceses urls from a different part of the site (Structured view - World Regions). I would expect the dioceses count there to match the count from the letters list.
from httpx import Client, AsyncClient, Limits
from bs4 import BeautifulSoup as bs
import pandas as pd
import re
from datetime import datetime
import asyncio
import nest_asyncio
nest_asyncio.apply()
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
big_df_list = []
def all_dioceses():
dioceses = []
root_links = [f'https://www.catholic-hierarchy.org/diocese/qview{x}.html' for x in range(1, 8)]
with Client(headers=headers, timeout=60.0, follow_redirects=True) as client:
for x in root_links:
r = client.get(x)
soup = bs(r.text)
soup.select_one('ul#menu2').decompose()
for link in soup.select('ul > li > a'):
dioceses.append('https://www.catholic-hierarchy.org/diocese/' link.get('href'))
return dioceses
# print(all_dioceses())
async def get_diocese_info(url):
async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
try:
r = await client.get(url)
soup = bs(r.text)
d_name = soup.select_one('h1[align="center"]').get_text(strip=True)
info_table = soup.select_one('div[id="d1"] > table')
d_bishops = ' | '.join([x.get_text(strip=True) for x in info_table.select('td')[0].select('li')])
d_extra_info = ' | '.join([x.get_text(strip=True) for x in info_table.select('td')[1].select('li')])
big_df_list.append((d_name, d_bishops, d_extra_info, url))
print('done', d_name)
except Exception as e:
print(url, e)
async def scrape_dioceses():
start_time = datetime.now()
tasks = asyncio.Queue()
for x in all_dioceses():
tasks.put_nowait(get_diocese_info(x))
async def worker():
while not tasks.empty():
await tasks.get_nowait()
await asyncio.gather(*[worker() for _ in range(100)])
end_time = datetime.now()
duration = end_time - start_time
print('diocese scraping took', duration)
asyncio.run(scrape_dioceses())
df = pd.DataFrame(big_df_list, columns = ['Name', 'Bishops', 'Info', 'Url'])
print(df)
this should lead to the following resuts:
done Eparchy of Mississauga (Syro-Malabar)
done Eparchy of Mar Addai of Toronto (Chaldean)
done Eparchy of Saint-Sauveur de Montr�al (Melkite Greek)
done Diocese of Calgary
done Archdiocese of Winnipeg
[...]
diocese scraping took 0:03:02.366096
Name Bishops Info Url
0 Eparchy of Mississauga (Syro-Malabar) JoseKalluvelil, Bishop Type of Jurisdiction: Eparchy | Elevated:22 December2018 | Immediately Subject to the Holy See | Syro-Malabar Catholic Church of the Chaldean Tradition | Country:Canada | Mailing Address: Syro-Malabar Apostolic Exarchate, 6630 Turner Valley Rd., Mississauga, ON L5V 2P1, Canada | Telephone: (905)858-8200 | Fax: 858-8208 https://www.catholic-hierarchy.org/diocese/dmism.html
1 Eparchy of Mar Addai of Toronto (Chaldean) Robert SaeedJarjis, Bishop | Bawai (Ashur)Soro, Bishop Emeritus Type of Jurisdiction: Eparchy | Erected:10 June2011 | Immediately Subject to the Holy See | Chaldean Catholic Church of the Chaldean Tradition | Country:Canada | Conference Region:Ontario | Mailing Address: 2 High Meadow Place, Toronto, ON M9L 2Z5, Canada | Telephone: (416)746-5816 | Fax: 746-5850 https://www.catholic-hierarchy.org/diocese/dtoch.html
2 Eparchy of Saint-Sauveur de Montr�al (Melkite Greek) MiladJawish, B.S., Bishop Type of Jurisdiction: Eparchy | Elevated:1 September1984 | Immediately Subject to the Holy See | Melkite Greek Catholic Church of the Byzantine Tradition | Country:Canada | Conference Region:Quebec | Web Site:http://www.melkite.com/ | Mailing Address: 10025 boul. de l'Arcadie, Montreal, QC H4N 2S1, Canada | Telephone: (514)272.6430 | Fax: 202.1274 https://www.catholic-hierarchy.org/diocese/dmome.html
note - it is for me impossible to run this on collab - how to simplify this in order to run this code in collab!?
well - i get errors - i get back this when running this in the collab:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-64bb145c85bf> in <module>
----> 1 from httpx import Client, AsyncClient, Limits
2 from bs4 import BeautifulSoup as bs
3 import pandas as pd
4 import re
5 from datetime import datetime
ModuleNotFoundError: No module named 'httpx'
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
note - it is for me impossible to run this on collab - how to simplify this in order to run this code in collab!?
Mauro Martins mentioned to run this - but wait; i am not a pro user on collab-. so the question is: how to simplify this that i can run it on colab - on a ordinary collab account
!pip install httpx nest_asyncio
Try running this code before your script.
Many thanks for the quick reply. Awesome. i understand your approach: but i need a pro account on colab - note: i do not have this . So the question is: Can i simplify the script so that it would run on a general collab account - without any issues
many thanks - dear Mauro Martins Junior - it helps - this code helped.:
!pip install httpx nest_asyncio
note:
update: thanks to Mauro Martin i have learned to update plugins to colab:
How do I install Python packages in Google's Colab?
How do I install Python packages in Google's Colab?
In a project, I have e.g. two different packages, How can I use the setup.py to install these two packages in the Google's Colab, so that I can import the packages?
see the answer:
you can use !setup.py install to do that. Colab is just like a Jupyter notebook. Therefore, we can use the ! operator here to install any package in Colab. What ! actually does is, it tells the notebook cell that this line is not a Python code, its a command line script. So, to run any command line script in Colab, just add a ! preceding the line. For example: !pip install tensorflow. This will treat that line (here pip install tensorflow) as a command prompt line and not some Python code. However, if you do this without adding the ! preceding the line, it'll throw up an error saying "invalid syntax". But keep in mind that you'll have to upload the setup.py file to your drive before doing this (preferably into the same folder where your notebook is).
CodePudding user response:
It seems that the problem is that some libs are missing. I tried to run here on colabs and after running the command below, it worked fine.
!pip install httpx nest_asyncio
Try running this code before your script.
Note that the exclamation point is part of the command.