I'm learning web scraping and am trying to web crawl data from the below link. Is there a way for me to crawl the link from each of the td as well?
The website link: http://eecs.qmul.ac.uk/postgraduate/programmes/
Here's what I did so far.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
table_list = []
rows = soup.find_all('tr')
# For every row in the table, find each cell element and add it to the list
for row in rows:
row_td = row.find_all('td')
row_cells = str(row_td)
row_cleantext = BeautifulSoup(row_cells, "lxml").get_text()
table_list.append((row_cleantext))
print(table_list)
CodePudding user response:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
main_data=soup.find_all("td")
You can find main_data
and iterate over that so you will get specific td
tag and now find a
and use .get
for href
extraction and if any Attribute is not present so you can use try-except
to handle exceptions
for data in main_data:
try:
link=data.find("a").get("href")
print(link)
except AttributeError:
pass
For Understing only:
main_data=soup.find_all("td")
for data in main_data:
try:
link=data.find("a")
print(link.text)
print(link.get("href"))
except AttributeError:
pass
Output:
H60C
https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
H60A
https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
..
For creating table you can use pandas
module
main_data=soup.find_all("td")
dict1={}
for data in main_data:
try:
link=data.find("a")
dict1[link.text]=link.get("href")
except AttributeError:
pass
import pandas as pd
df=pd.DataFrame(dict1.items(),columns=["Text","Link"])
Output:
Text Link
0 H60C https://www.qmul.ac.uk/postgraduate/taught/cou...
1 H60A https://www.qmul.ac.uk/postgraduate/taught/cou...
2 I4U2 https://www.qmul.ac.uk/postgraduate/taught/cou...
..
Getting table from website
import pandas as pd
data=pd.read_html("http://eecs.qmul.ac.uk/postgraduate/programmes/")
df=data[0]
df
Output
Postgraduate degree programmes Part-time(2 year) Full-time(1 year)
0 Advanced Electronic and Electrical Engineering H60C H60A
1 Artificial Intelligence I4U2 I4U1
.....