Home > Software engineering >  How can I scape data from nested tables in Python?
How can I scape data from nested tables in Python?

Time:12-29

I am trying to scrape radiofeeds so that I can compile the following list of all UK radio stations.

Station Name Geographical Location Stream URLs
Athavan Radio Greater London { 64: "http://38.96.148.140:6150/listen.pls?sid=1",
128 : "http://www.radiofeeds.net/playlists/athavan.m3u",
320 : "http://www.radiofeeds.net/playlists/zeno.pls?station=e4k5enu6g18uv"}
... ... ...

The stream URLs are provided in a nested table, which clearly causes difficulties when searching for new rows based on the next tr tag, as the next new row might be in a child table.

Thus far, I have tried the following in BeautifulSoup.

from bs4 import BeautifulSoup
from requests import get

url = "http://www.radiofeeds.co.uk/mp3.asp"
page = get(url=url).text
lead = "Listen live online to"
foot = "Have <b>YOUR</b> internet"
start = page.find(lead)
stop = page.find(foot)

soup = BeautifulSoup(page[start:stop], "html.parser")

data = []
table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
    station_name = row.find("a").text
    print(f"STATION_NAME: {station_name}")

    cols = len(row.find_all("td"))
    print(f"COL_COUNT: {cols}")

    print("=====")

The output of this for the first three stations is,

STATION_NAME: 121 Radio
COL_COUNT: 9
=====
STATION_NAME: 

COL_COUNT: 2
=====
STATION_NAME: 10-fi Radio
COL_COUNT: 9
=====
STATION_NAME: 

COL_COUNT: 2
=====
STATION_NAME: 45 Radio
COL_COUNT: 9
=====
STATION_NAME: 

COL_COUNT: 2
=====

Each loop is clearly jumping between the parent and child table, as shown by the varying COL_COUNT. How do I search the current row including its child tables?

I am happy to use a different library if BeautifulSoup is not the best one for this use.

CodePudding user response:

try this:

from bs4 import BeautifulSoup
import pandas as pd
import requests

response = requests.get('http://www.radiofeeds.co.uk/mp3.asp')
soup = BeautifulSoup(response.text, 'lxml')
result = []
for radio in soup.find_all('a', {'onclick': 'if(!confirm(txt)) return false;'}):
    result.append({
        'Station Name': radio.get_text(),
        'Stream URL': radio.findNext('td', {'width': '68%'}).find('a').get('href')
    })
df = pd.DataFrame(result)
print(df)

OUTPUT:

                    Station Name                                         Stream URL
0                      121 Radio  http://ukwesta.streaming.broadcast.radio:10650...
1                    10-fi Radio     http://edge.clrmedia.co.uk:10000/chiptunes.m3u
2                       45 Radio  http://stream1.themediasite.co.uk:8020/listen....
3                     80s Rhythm  http://www.radiofeeds.net/playlists/80srhythmh...
4                   Abacus Radio  http://stream3.hippynet.co.uk:8008/listen.pls?...
..                           ...                                                ...
605  YO1 Radio (North Yorkshire)        http://ec1.yesstreaming.net:1490/stream.m3u
606                YorkMix Radio      http://stream2.hippynet.co.uk:8020/mobile.m3u
607               Your Harrogate  http://streaming.broadcastradio.com:11040/your...
608                         Zest  http://vps.zestliverpool.com:8000/listen.pls?s...
609                     Zest 60s  http://vps.zestliverpool.com:8000/listen.pls?s...
  • Related