I'm developing a web scraping to collect some information from AllMusic. However, I am having difficulties to correctly return information when there is more than one option inside the tag (e.g. href).
Question: I need to return the first music genre for each artist. In the case of one value per artist, my code works. However, in situations with more than one music genre, I'm not able to select just the first one. Here is the code created:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
performer = []
links = []
genre = []
for artist in artists:
url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" urllib.parse.quote(artist))
soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
div = soup.select("div.name")[0]
link = div.find_all('a')[0]['href']
links.append(link)
for l in links:
soup = BeautifulSoup(requests.get(l, headers=headers).content, "html.parser")
divGenre= soup.select("div.genre")[0]
genres = divGenre.find('a')
performer.append(artist)
genre.append(genres.text)
df = pd.DataFrame(zip(performer, genre, links), columns=["artist", "genre", "link"])
df
CodePudding user response:
Hopfully understand your question right - Main issue is that you iterate the links
inside your for-loop
and that causes the repetition.
May change your strategy, try to get all information in one iteration and store them in a more structured way.
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
data = []
for artist in artists:
url= urllib.request.urlopen("https://www.allmusic.com/search/artist/" urllib.parse.quote(artist))
soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
link = soup.select_one("div.name a").get('href')
soup = BeautifulSoup(requests.get(link, headers=headers).content, "html.parser")
data.append({
'artist':artist,
'genre':soup.select_one("div.genre a").text,
'link':link
})
print(pd.DataFrame(data).to_markdown(index=False))
Output
artist | genre | link |
---|---|---|
Alexander 23 | Pop/Rock | https://www.allmusic.com/artist/alexander-23-mn0003823464 |
Alex & Sierra | Folk | https://www.allmusic.com/artist/alex-sierra-mn0003280540 |
Tion Wayne | Rap | https://www.allmusic.com/artist/tion-wayne-mn0003666177 |
Tom Cochrane | Pop/Rock | https://www.allmusic.com/artist/tom-cochrane-mn0000931015 |
The Waked | Electronic | https://www.allmusic.com/artist/the-waked-mn0004025091 |