I'm very new to webscraping and I'm grabbing from a website from Billboard that compiled the top 10 summer songs for each year from 1958 to 2021. My main goal is to end up with a dictionary with the year number as the key and a list with the 10 songs as the associated value.
{"1958": ["NEL BLU DIPINTO DI BLU (VOLARÉ)", ...], "1959": ["LONELY BOY", ...]}
What I have so far is a list of each year and their songs, where each value in the list is multiple lines and appears as follows:
1958Rank, Title, Artist
1, NEL BLU DIPINTO DI BLU (VOLARÉ), Domenico Modugno
2, POOR LITTLE FOOL, Ricky Nelson
3, PATRICIA, Perez Prado And His Orchestra
4, LITTLE STAR, The Elegants
5, MY TRUE LOVE, Jack Scott
6, JUST A DREAM, Jimmy Clanton And His Rockets
7, WHEN, Kalin Twins
8, BIRD DOG, The Everly Brothers
9, SPLISH SPLASH, Bobby Darin
10, REBEL-‘ROUSER, Duane Eddy His Twangy Guitar And The Rebels
Is there any way to extract just the song titles and add them to a separate list? I'm thinking it could be either done by somehow checking if the substring is fully capitalized, since the song titles are in all caps, or if the substring is between two commas, as the titles are placed inbetween a comma after its place value and at the end of the song title.
The link for the Billboard website is attached here: https://www.billboard.com/pro/summer-songs-1985-present-top-10-tunes-each-summer-listen/
CodePudding user response:
There is no need for regex
- To get your expected output select only the <p>
that has an <strong>
and iterate over its texts [s.split(', ')[1] for s in p.find_all(text=True)[2:]]
:
from bs4 import BeautifulSoup
import pandas as pd
import requests
doc = BeautifulSoup(requests.get(https://www.billboard.com/pro/summer-songs-1985-present-top-10-tunes-each-summer-listen/).text)
data = []
for p in doc.select('.pmc-paywall p:has(strong)'):
data.append({
p.strong.text:[s.split(', ')[1] for s in p.find_all(text=True)[2:]]
})
print(data)
Output:
[{'1958': ['NEL BLU DIPINTO DI BLU (VOLARÉ)', 'POOR LITTLE FOOL', 'PATRICIA', 'LITTLE STAR', 'MY TRUE LOVE', 'JUST A DREAM', 'WHEN', 'BIRD DOG', 'SPLISH SPLASH', 'REBEL-‘ROUSER']}, {'1959': ['LONELY BOY', 'THE BATTLE OF NEW ORLEANS', 'A BIG HUNK O’ LOVE', 'MY HEART IS AN OPEN BOOK', 'THE THREE BELLS', 'PERSONALITY', 'THERE GOES MY BABY', 'LAVENDER-BLUE', 'WATERLOO', 'TIGER']}, {'1960': ['I’M SORRY', 'IT’S NOW OR NEVER', 'EVERYBODY’S SOMEBODY’S FOOL', 'ALLEY-OOP', 'ITSY BITSY TEENIE WEENIE YELLOW POLKADOT BIKINI', 'ONLY THE LONELY (KNOW HOW I FEEL)', 'WALK — DON’T RUN', 'CATHY’S CLOWN', 'MULE SKINNER BLUES', 'BECAUSE THEY’RE YOUNG']},...]
One approach to get a bit more structured data including rank and artist that you can use to build a dataframe easily could be:
...
data = []
for p in doc.select('.pmc-paywall p:has(strong)'):
for s in [dict(zip(p.find_all(text=True)[1].split(','),s.strip().split(', '))) for s in p.find_all(text=True)[2:]]:
s.update({'year':p.strong.text})
data.append(s)
pd.DataFrame(data)
Rank | Title | Artist | year |
---|---|---|---|
1 | NEL BLU DIPINTO DI BLU (VOLARÉ) | Domenico Modugno | 1958 |
2 | POOR LITTLE FOOL | Ricky Nelson | 1958 |
3 | PATRICIA | Perez Prado And His Orchestra | 1958 |
4 | LITTLE STAR | The Elegants | 1958 |
5 | MY TRUE LOVE | Jack Scott | 1958 |
....