Home > Software design >  Scraping returning clipped content (Python)
Scraping returning clipped content (Python)

Time:10-30

I have a set of URLs stored in a list and I want to make a script to collect Genius site lyrics and store them, each in a txt file.

I've already made this script, but for some reason the content returned isn't complete.

Here's the code:

import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
from time import time

urls = ['https://genius.com/The-Stooges-1969-lyrics','https://genius.com/The-Stooges-1970-lyrics',
        'https://genius.com/The-Rolling-Stones-19th-Nervous-Breakdown-lyrics','https://genius.com/Lil-Wayne-3-Peat-lyrics',
        'https://genius.com/RunDMC-30-Days-lyrics','https://genius.com/Bob-marley-and-the-wailers-four-hundred-years-lyrics',
        'https://genius.com/The-Clash-48-Hours-lyrics']

start = time()

for u in urls:
    soup = BeautifulSoup(requests.get(u).content, 'lxml')
    for tag in soup.select('div[class^="Lyrics__Container"], .song_body-lyrics p'):
        lyrics = tag.get_text(strip=True, separator='\n')
        if lyrics:
            with open("PATH\\" str(urls.index(u)) ".txt", 'w') as f: 
                f.write(lyrics)      

print(f'Time taken: {time() - start}')

See, for example, the lyrics of the song on the url: https://genius.com/Rundmc-30-days-lyrics.

Now see the return obtained:

"[DMC] If you need a vacation, we can fly the world And you'll know I'll never look at another girl I'm a single-minded man, and my mind is set You're the lady of the '80s that I'm gonna get [Both] And if you find you don't like my ways Well, you can send me back in 30 days"

Somehow I can access the lyrics, but there seems to be something missing to make the script robust, because it cuts the content in certain situations.

Does anyone have any idea what I might be wrong?

CodePudding user response:

I can't really see why it's doing that, but might just be that the site renders differently at times. I made a few adjustments and so far haven't seen the issue. It's possibly from the way it's parsing the text then how you're writing to file, so I adjusted some of the indenting in the for loop in how it concatenates the strings:

import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
from time import time

urls = ['https://genius.com/The-Stooges-1969-lyrics','https://genius.com/The-Stooges-1970-lyrics',
        'https://genius.com/The-Rolling-Stones-19th-Nervous-Breakdown-lyrics','https://genius.com/Lil-Wayne-3-Peat-lyrics',
        'https://genius.com/RunDMC-30-Days-lyrics','https://genius.com/Bob-marley-and-the-wailers-four-hundred-years-lyrics',
        'https://genius.com/The-Clash-48-Hours-lyrics']


start = time()
headers = {'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Mobile Safari/537.36'}
for u in urls:
    response = requests.get(u, headers=headers)
    #print(response)
    soup = BeautifulSoup(response.text, 'lxml')
    
    lyrics = ''
    for tag in soup.find_all("div", {"class":re.compile(r'^Lyrics__Container')}):
        lyrics  = tag.get_text(strip=True, separator='\n')   '\n'
    if lyrics:
        with open("D:/test/lyrics/" str(urls.index(u)) ".txt", 'w') as f: 
            f.write(lyrics)  
        #print(lyrics)

print(f'Time taken: {time() - start}')
  • Related