I have a set of URLs stored in a list and I want to make a script to collect Genius site lyrics and store them, each in a txt file.
I've already made this script, but for some reason the content returned isn't complete.
Here's the code:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
from time import time
urls = ['https://genius.com/The-Stooges-1969-lyrics','https://genius.com/The-Stooges-1970-lyrics',
'https://genius.com/The-Rolling-Stones-19th-Nervous-Breakdown-lyrics','https://genius.com/Lil-Wayne-3-Peat-lyrics',
'https://genius.com/RunDMC-30-Days-lyrics','https://genius.com/Bob-marley-and-the-wailers-four-hundred-years-lyrics',
'https://genius.com/The-Clash-48-Hours-lyrics']
start = time()
for u in urls:
soup = BeautifulSoup(requests.get(u).content, 'lxml')
for tag in soup.select('div[class^="Lyrics__Container"], .song_body-lyrics p'):
lyrics = tag.get_text(strip=True, separator='\n')
if lyrics:
with open("PATH\\" str(urls.index(u)) ".txt", 'w') as f:
f.write(lyrics)
print(f'Time taken: {time() - start}')
See, for example, the lyrics of the song on the url: https://genius.com/Rundmc-30-days-lyrics.
Now see the return obtained:
"[DMC] If you need a vacation, we can fly the world And you'll know I'll never look at another girl I'm a single-minded man, and my mind is set You're the lady of the '80s that I'm gonna get [Both] And if you find you don't like my ways Well, you can send me back in 30 days"
Somehow I can access the lyrics, but there seems to be something missing to make the script robust, because it cuts the content in certain situations.
Does anyone have any idea what I might be wrong?
CodePudding user response:
I can't really see why it's doing that, but might just be that the site renders differently at times. I made a few adjustments and so far haven't seen the issue. It's possibly from the way it's parsing the text then how you're writing to file, so I adjusted some of the indenting in the for loop in how it concatenates the strings:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
from time import time
urls = ['https://genius.com/The-Stooges-1969-lyrics','https://genius.com/The-Stooges-1970-lyrics',
'https://genius.com/The-Rolling-Stones-19th-Nervous-Breakdown-lyrics','https://genius.com/Lil-Wayne-3-Peat-lyrics',
'https://genius.com/RunDMC-30-Days-lyrics','https://genius.com/Bob-marley-and-the-wailers-four-hundred-years-lyrics',
'https://genius.com/The-Clash-48-Hours-lyrics']
start = time()
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Mobile Safari/537.36'}
for u in urls:
response = requests.get(u, headers=headers)
#print(response)
soup = BeautifulSoup(response.text, 'lxml')
lyrics = ''
for tag in soup.find_all("div", {"class":re.compile(r'^Lyrics__Container')}):
lyrics = tag.get_text(strip=True, separator='\n') '\n'
if lyrics:
with open("D:/test/lyrics/" str(urls.index(u)) ".txt", 'w') as f:
f.write(lyrics)
#print(lyrics)
print(f'Time taken: {time() - start}')