Home > other >  How to select first element in multi-valued html tags?
How to select first element in multi-valued html tags?

Time:04-05

I'm developing a web scraping to collect some information from AllMusic. However, I am having difficulties to correctly return information when there is more than one option inside the tag (e.g. href).

Question: I need to return the first music genre for each artist. In the case of one value per artist, my code works. However, in situations with more than one music genre, I'm not able to select just the first one. Here is the code created:

import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}


performer = []
links = []
genre = []

for artist in artists:
  url= urllib.request.urlopen("https://www.allmusic.com/search/artist/"   urllib.parse.quote(artist))
  soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
  div = soup.select("div.name")[0]
  link = div.find_all('a')[0]['href']
  links.append(link)
  for l in links:
    soup = BeautifulSoup(requests.get(l, headers=headers).content, "html.parser")
    divGenre= soup.select("div.genre")[0] 
    genres = divGenre.find('a')
    performer.append(artist)
    genre.append(genres.text)

df = pd.DataFrame(zip(performer, genre, links), columns=["artist", "genre", "link"])
df

CodePudding user response:

Hopfully understand your question right - Main issue is that you iterate the links inside your for-loop and that causes the repetition.

May change your strategy, try to get all information in one iteration and store them in a more structured way.

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib.request
artists =['Alexander 23', 'Alex & Sierra', 'Tion Wayne', 'Tom Cochrane','The Waked']
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

data = []

for artist in artists:
    url= urllib.request.urlopen("https://www.allmusic.com/search/artist/"   urllib.parse.quote(artist))
    soup = BeautifulSoup(requests.get(url.geturl(), headers=headers).content, "html.parser")
    link = soup.select_one("div.name a").get('href')
    soup = BeautifulSoup(requests.get(link, headers=headers).content, "html.parser")
    data.append({
        'artist':artist,
        'genre':soup.select_one("div.genre a").text,
        'link':link
    })

print(pd.DataFrame(data).to_markdown(index=False))
Output
artist genre link
Alexander 23 Pop/Rock https://www.allmusic.com/artist/alexander-23-mn0003823464
Alex & Sierra Folk https://www.allmusic.com/artist/alex-sierra-mn0003280540
Tion Wayne Rap https://www.allmusic.com/artist/tion-wayne-mn0003666177
Tom Cochrane Pop/Rock https://www.allmusic.com/artist/tom-cochrane-mn0000931015
The Waked Electronic https://www.allmusic.com/artist/the-waked-mn0004025091
  • Related