I'm trying to parse lyrics site and I need to collect song's lyrics. I have issues with my output
I need to have lyrics displayed as below enter image description here
I've figured out how to split text at uppercase, but there is one thing remains: the brackets are splitted unproperly, here's my code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://genius.com/Taylor-swift-lavender-haze-lyrics')
#print(r.status_code)
if r.status_code != 200:
print('Error')
soup = BeautifulSoup(r.content, 'lxml')
titles = soup.find_all('title')
titles = titles[0].text
titlist = titles.split('Lyrics | ')
titlist.pop(1)
titlist = titlist[0].replace("\xa0", " ")
print(titlist)
divs = soup.find_all('div', {'class' : 'Lyrics__Container-sc-1ynbvzw-6 YYrds'})
#print(divs[0].text)
lyrics = (divs[0].text)
res = re.findall(r'[A-Z][^A-Z]*', lyrics)
res_l = []
for el in res:
res_l.append(el '\n')
print(el)
and output is snown on a screenshot. How do I fix it?enter image description here
for those, who asked, added a full code
CodePudding user response:
As brackets have a meaning in regex you'll need to escape them. In python you should be able to use [ to get what you want.
CodePudding user response:
You can .unwrap
unnecessary tags (<a>
, <span>
), replace <br>
with newlines and then get text:
import requests
from bs4 import BeautifulSoup
url = "https://genius.com/Taylor-swift-lavender-haze-lyrics"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for t in soup.select("#lyrics-root [data-lyrics-container]"):
for tag in t.select("span, a"):
tag.unwrap()
for br in t.select("br"):
br.replace_with("\n")
print(t.text)
Prints:
[Intro]
Meet me at midnight
[Verse 1]
Staring at the ceiling with you
Oh, you don't ever say too much
And you don't really read into
My melancholia
[Pre-Chorus]
I been under scrutiny (Yeah, oh, yeah)
You handle it beautifully (Yeah, oh, yeah)
All this shit is new to me (Yeah, oh, yeah)
...and so on.