How to split at uppercase and brackets-CodePudding

I'm trying to parse lyrics site and I need to collect song's lyrics. I have issues with my output

I need to have lyrics displayed as below enter image description here

I've figured out how to split text at uppercase, but there is one thing remains: the brackets are splitted unproperly, here's my code:

import re
import requests
from bs4 import BeautifulSoup


r = requests.get('https://genius.com/Taylor-swift-lavender-haze-lyrics')
#print(r.status_code)
if r.status_code != 200:
    print('Error')
soup = BeautifulSoup(r.content, 'lxml')
titles = soup.find_all('title')
titles = titles[0].text
titlist = titles.split('Lyrics | ')
titlist.pop(1)
titlist = titlist[0].replace("\xa0", " ")
print(titlist)
divs = soup.find_all('div', {'class' : 'Lyrics__Container-sc-1ynbvzw-6 YYrds'})
#print(divs[0].text)
lyrics = (divs[0].text)
res = re.findall(r'[A-Z][^A-Z]*', lyrics)
res_l = []
for el in res:
    res_l.append(el   '\n')
    print(el)

and output is snown on a screenshot. How do I fix it?enter image description here

for those, who asked, added a full code

CodePudding user response：

As brackets have a meaning in regex you'll need to escape them. In python you should be able to use [ to get what you want.

CodePudding user response：

You can .unwrap unnecessary tags (<a>, <span>), replace <br> with newlines and then get text:

import requests
from bs4 import BeautifulSoup

url = "https://genius.com/Taylor-swift-lavender-haze-lyrics"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

for t in soup.select("#lyrics-root [data-lyrics-container]"):

    for tag in t.select("span, a"):
        tag.unwrap()

    for br in t.select("br"):
        br.replace_with("\n")

    print(t.text)

Prints:

[Intro]
Meet me at midnight

[Verse 1]
Staring at the ceiling with you
Oh, you don't ever say too much
And you don't really read into
My melancholia

[Pre-Chorus]
I been under scrutiny (Yeah, oh, yeah)
You handle it beautifully (Yeah, oh, yeah)
All this shit is new to me (Yeah, oh, yeah)

...and so on.