Script is not returning proper output when trying to retrieve data from a newsletter-CodePudding

I am trying to write a script that can retrieve album title and band name from a music store newsletter. The band name and album title are hidden in a h3 & h4 class. When executing the script I get a blank output in the csv file.

from bs4 import BeautifulSoup
import requests
import pandas as pd

# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)

# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all 'a' elements with the class 'row'
albums = soup.find_all('a', attrs={'class': 'row'})

# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
  album_title_element = album.find('td', attrs={'td_class': 'h3 class'})
  band_name_element = album.find('td', attrs={'td_class': 'h4 class'})
  album_title.append(album_title_element.text)
  band_name.append(band_name_element.text)

# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')

I think the error is in the attrs part, not sure how to fix it properly. Thanks in advance!

CodePudding user response：

Looking at your code I agree that the error lies in the attrs part. The problem you are facing is that the site you are trying to scrape does not contain 'a' elements with the 'row' class. Thus find_all returns an empty list. There are plenty of 'div' elements with the 'row' class, maybe you meant to look for those?

You had the right idea by looking for 'td' elements and extracting their 'h3' and 'h4' elements, but since albums is an empty list, there are no elements to find.

I changed your code slightly to look for 'td' elements directly and extract their 'h3' and 'h4' elements. With these small changes your code found 29 albums.

from bs4 import BeautifulSoup
import requests
import pandas as pd

# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)

# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'} )

# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for i, album in enumerate(albums):
  album_title_element = album.find('h3')
  band_name_element = album.find('h4')
  album_title.append(album_title_element.text)
  band_name.append(band_name_element.text)

# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv', index=False)

I also took the liberty of adding index=False to the last line of your code. This makes it so each row doesn't start with a ,.

Hope this helps.

CodePudding user response：

from bs4 import BeautifulSoup
import requests
import pandas as pd

# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)

# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'})

# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
  album_title_element = album.find('h3', attrs={'class': 'header'})
  band_name_element = album.find('h4', attrs={'class': 'header'})
  album_title.append(album_title_element.text)
  band_name.append(band_name_element.text)

# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')

Thanks to the anonymous hero for helping out!