I am trying to scrape radiofeeds so that I can compile the following list of all UK radio stations.
Station Name | Geographical Location | Stream URLs |
---|---|---|
Athavan Radio | Greater London | { 64: "http://38.96.148.140:6150/listen.pls?sid=1", 128 : "http://www.radiofeeds.net/playlists/athavan.m3u", 320 : "http://www.radiofeeds.net/playlists/zeno.pls?station=e4k5enu6g18uv"} |
... | ... | ... |
The stream URLs are provided in a nested table, which clearly causes difficulties when searching for new rows based on the next tr
tag, as the next new row might be in a child table.
Thus far, I have tried the following in BeautifulSoup.
from bs4 import BeautifulSoup
from requests import get
url = "http://www.radiofeeds.co.uk/mp3.asp"
page = get(url=url).text
lead = "Listen live online to"
foot = "Have <b>YOUR</b> internet"
start = page.find(lead)
stop = page.find(foot)
soup = BeautifulSoup(page[start:stop], "html.parser")
data = []
table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
station_name = row.find("a").text
print(f"STATION_NAME: {station_name}")
cols = len(row.find_all("td"))
print(f"COL_COUNT: {cols}")
print("=====")
The output of this for the first three stations is,
STATION_NAME: 121 Radio
COL_COUNT: 9
=====
STATION_NAME:
COL_COUNT: 2
=====
STATION_NAME: 10-fi Radio
COL_COUNT: 9
=====
STATION_NAME:
COL_COUNT: 2
=====
STATION_NAME: 45 Radio
COL_COUNT: 9
=====
STATION_NAME:
COL_COUNT: 2
=====
Each loop is clearly jumping between the parent and child table, as shown by the varying COL_COUNT
. How do I search the current row including its child tables?
I am happy to use a different library if BeautifulSoup is not the best one for this use.
CodePudding user response:
try this:
from bs4 import BeautifulSoup
import pandas as pd
import requests
response = requests.get('http://www.radiofeeds.co.uk/mp3.asp')
soup = BeautifulSoup(response.text, 'lxml')
result = []
for radio in soup.find_all('a', {'onclick': 'if(!confirm(txt)) return false;'}):
result.append({
'Station Name': radio.get_text(),
'Stream URL': radio.findNext('td', {'width': '68%'}).find('a').get('href')
})
df = pd.DataFrame(result)
print(df)
OUTPUT:
Station Name Stream URL
0 121 Radio http://ukwesta.streaming.broadcast.radio:10650...
1 10-fi Radio http://edge.clrmedia.co.uk:10000/chiptunes.m3u
2 45 Radio http://stream1.themediasite.co.uk:8020/listen....
3 80s Rhythm http://www.radiofeeds.net/playlists/80srhythmh...
4 Abacus Radio http://stream3.hippynet.co.uk:8008/listen.pls?...
.. ... ...
605 YO1 Radio (North Yorkshire) http://ec1.yesstreaming.net:1490/stream.m3u
606 YorkMix Radio http://stream2.hippynet.co.uk:8020/mobile.m3u
607 Your Harrogate http://streaming.broadcastradio.com:11040/your...
608 Zest http://vps.zestliverpool.com:8000/listen.pls?s...
609 Zest 60s http://vps.zestliverpool.com:8000/listen.pls?s...