How to select first element in multi-valued html tags?-CodePudding

I'm developing a web scraping to collect some information from AllMusic. However, I am having difficulties to correctly return information when there is more than one option inside the tag (e.g. href).

Question: I need to return the first music genre for each artist. In the case of one value per artist, my code works. However, in situations with more than one music genre, I'm not able to select just the first one. Here is the code created:

import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request

artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}


performer = []
links = []
genre = []

for artist in artists:
  url= urllib.request.urlopen("https://www.allmusic.com/search/artist/"   urllib.parse.quote(artist))
  soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
  div = soup.select("div.name")[0]
  link = div.find_all('a')[0]['href']
  links.append(link)
  for l in links:
    soup = BeautifulSoup(requests.get(l, headers=headers).content, "html.parser")
    divGenre= soup.select("div.genre")[0] 
    genres = divGenre.find('a')
    performer.append(artist)
    genre.append(genres.text)

df = pd.DataFrame(zip(performer, genre, links), columns=["artist", "genre", "link"])
df

CodePudding user response：

Hopfully understand your question right - Main issue is that you iterate the links inside your for-loop and that causes the repetition.

May change your strategy, try to get all information in one iteration and store them in a more structured way.

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

data = []

for artist in artists:
    url= urllib.request.urlopen("https://www.allmusic.com/search/artist/"   urllib.parse.quote(artist))
    soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
    link = soup.select_one("div.name a").get('href')
    soup = BeautifulSoup(requests.get(link, headers=headers).content, "html.parser")
    data.append({
        'artist':artist,
        'genre':soup.select_one("div.genre a").text,
        'link':link
    })

print(pd.DataFrame(data).to_markdown(index=False))

Output

artist	genre	link
Alexander 23	Pop/Rock	https://www.allmusic.com/artist/alexander-23-mn0003823464
Alex & Sierra	Folk	https://www.allmusic.com/artist/alex-sierra-mn0003280540
Tion Wayne	Rap	https://www.allmusic.com/artist/tion-wayne-mn0003666177
Tom Cochrane	Pop/Rock	https://www.allmusic.com/artist/tom-cochrane-mn0000931015
The Waked	Electronic	https://www.allmusic.com/artist/the-waked-mn0004025091