Trying To Scrape A Spotify Playlist But It Only Gets The First 20 Results Out Of 100-CodePudding

I was trying to learn selenium and for fun I decided to scrape a Spotify Playlist(hence i didnt use the spotify API for this) but its not obtaining the full list, just the songs that are loaded, I tried the solutions in the web with scrolling and waiting but nothing seems to be working, also tried zooming out and it helps but only finds like 20 30 more results, also when i scroll down manually and try scraping it ignores the first few songs and starts scraping from the part that is loaded. This is my code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import pandas as pd
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

website= "https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu"
path= "C:/Users/ashut/Downloads/Misc Docs/chromedriver_win32/chromedriver.exe"

service=Service(executable_path=path)
driver=webdriver.Chrome(service=service)

driver.get(website) 
containers=driver.find_elements(by="xpath",value='//div[@data-testid="tracklist-row"]/div[@aria-colindex="2"]/div')

titles = []
artists = []
links = []

for container in containers:
    title=container.find_element(by="xpath", value='./a/div').text
    artist=container.find_element(by="xpath", value='./span/a').text
    link=container.find_element(by="xpath", value='./span/a').get_attribute("href")
    titles.append(title)
    artists.append(artist)
    links.append(link)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)
    
mydict={'titles':titles,'artists':artists,'links':links}
artistslist= pd.DataFrame(mydict)
artistslist.to_csv('list_of_artist.csv')

CodePudding user response：

the data is dynamic load,and there maybe multiple artists for one item, I wrote one sample by leverage vscode extension clicknium, for my sample,you can see from github

CodePudding user response：

That page is dynamically loading content based on user's actions, in this case - scrolling and reaching the bottom. So you need to scroll the page to bottom (a few times), until all songs will load and be available in page. You can adapt the following snippet to your code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t



chrome_options = Options()
chrome_options.add_argument("--no-sandbox")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

song_list = []
url='https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu'
browser.get(url)

try:
    WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
    print("accepted cookies")
except Exception as e:
    print('no cookie button')


bottom_sentinel = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@data-testid='bottom-sentinel']")))

for x in range(5):
    songs = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-testid='tracklist-row']")))
    for song in songs:
        print(song.text)
        song_list.append(song.text)
    t.sleep(2)
    bottom_sentinel.location_once_scrolled_into_view
    browser.implicitly_wait(15)
print(list(set(song_list)))
print('Total songs:', len(list(set(song_list))))

This will print out quite a few duplicate songs, and at the end a list with unique songs, and the count for unique songs:

[...]
Total songs: 105