Home > Blockchain >  Python BeautifulSoup - How to crawl links <a> inside values in <td>
Python BeautifulSoup - How to crawl links <a> inside values in <td>

Time:12-06

I'm learning web scraping and am trying to web crawl data from the below link. Is there a way for me to crawl the link from each of the td as well?

The website link: http://eecs.qmul.ac.uk/postgraduate/programmes/

Here's what I did so far.

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

table_list = []
rows = soup.find_all('tr')

# For every row in the table, find each cell element and add it to the list
for row in rows:
    row_td = row.find_all('td')
    row_cells = str(row_td)
    row_cleantext = BeautifulSoup(row_cells, "lxml").get_text()
    table_list.append((row_cleantext))

print(table_list)

CodePudding user response:

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
main_data=soup.find_all("td")

You can find main_data and iterate over that so you will get specific td tag and now find a and use .get for href extraction and if any Attribute is not present so you can use try-except to handle exceptions

for data in main_data:
    try:
        link=data.find("a").get("href")
        print(link)
    except AttributeError:
        pass

For Understing only:

main_data=soup.find_all("td")
for data in main_data:
    try:
        link=data.find("a")
        print(link.text)
        print(link.get("href"))
    except AttributeError:
        pass

Output:

H60C
https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
H60A
https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/

..

For creating table you can use pandas module

main_data=soup.find_all("td")
dict1={}
for data in main_data:
    try:
        link=data.find("a")
        dict1[link.text]=link.get("href")
    except AttributeError:
        pass
import pandas as pd
df=pd.DataFrame(dict1.items(),columns=["Text","Link"])

Output:

    Text    Link
0   H60C    https://www.qmul.ac.uk/postgraduate/taught/cou...
1   H60A    https://www.qmul.ac.uk/postgraduate/taught/cou...
2   I4U2    https://www.qmul.ac.uk/postgraduate/taught/cou...
..

Getting table from website

import pandas as pd
data=pd.read_html("http://eecs.qmul.ac.uk/postgraduate/programmes/")
df=data[0]
df

Output

Postgraduate degree programmes  Part-time(2 year)   Full-time(1 year)
0   Advanced Electronic and Electrical Engineering  H60C    H60A
1   Artificial Intelligence I4U2    I4U1
.....
  • Related