How to speed up Spotipy API calls for millions of records?-CodePudding

I'm attempting to get the audio feature data for about 4.5 years worth of Spotify Top 200 Charts. It's for 68 countries global ranking, so about 20 million records in all. I'm querying a SQL Lite database with all of that data. This is prep for a data analysis project and I've currently limited my scope to just pulling the 3rd Friday of every month because the fastest time I could get pulling an entire day's worth of audio features for the charts is 15.8 minutes. That's 18.5 days of straight processing to get all 1701 days.

Does anyone see any way I could make this faster? I'm currently calling the spotipy.audio_features() function for each track id. The function is limited to 100 ids and I'm not so sure that would be much faster anyway.

Here's an example entry before processing:

column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams']

('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781)

And after processing:

column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']

('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781, '77lqbary6vt1DSc1MBN6sx', 0.708, 0.793, 5, -5.426, 0, 0.0342, 0.0136, 0.00221, 0.118, 0.734, 122.006, 239418, 4)

Full Script:

import sqlite3
import os
import spotipy
import numpy as np
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from requests.exceptions import ReadTimeout
from datetime import datetime

"""Gets the third Friday of each month and checks that the date exists in the database."""
def date_range_checker(cursor, start_date, end_date):
    # Put in the range for that year. It's till 2021.
    date_range = pd.date_range(start_date, end_date ,freq='WOM-3FRI')

    cursor.execute("""SELECT DISTINCT Date(date) FROM charts""")
    sql_date_fetch = cursor.fetchall()
    sql_dates = [r[0] for r in sql_date_fetch]

    validated_dates = []

    for date in date_range:
        # print(str(date)[0:-9])
        if str(date)[0:-9] in sql_dates:
            validated_dates.append(str(date)[0:-9])    
            
    return validated_dates

"""Connects to the database. For each date in validated_dates, it queries all the records with that date. 
Then splits the track IDs from the Spotify link into a new list of tuples. Then for each tuple in that list, it calls the Spotify API with the track ID.
Finally it creates a numpy array for the entire list so the csv converter can be used."""
def main(): 
    now_start = datetime.now()
    start_time = now_start.strftime("%H:%M:%S")
    print(f'Main Function - start time: {start_time}')

    """"This script queries """
    print("working on it...")

    dbname = 'charts.db'

    if os.path.exists(dbname):
        db = sqlite3.connect(dbname, isolation_level=None)
        cursor = db.cursor()

        """"Selects 3rd friday of the month because it takes about 15.8 minutes per day. That's 14.2 hours total to get one friday a month for all 4.5 years.
        Or 18.6 full days of processing for every single day for all 1701 days.
         Fridays are a preferable release day in the industry. Cite this later."""

        # Date range list created and checked in this function
        validated_dates = date_range_checker(cursor, '2017-02-01', '2017-12-31') # change year here

        column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability', 
        'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 
        'duration_ms', 'time_signature']
        
        for date_chosen in validated_dates:
            cursor.execute("""SELECT * FROM charts WHERE Date("date") = ?""", (date_chosen,))
            db_result = cursor.fetchall()

            data_with_track_ids = []
            final_data = []

            # Splits ID from Spotify link.
            for entry in db_result:
                track_id = entry[4].split('/')[-1]
                entry  = (track_id,)
                data_with_track_ids.append(entry)

            print("I've got all the track IDs. Will start calls to Spotify API now.")

            # Calls to spotify with the new extracted track_id
            for entry in data_with_track_ids:
                track_id = entry[-1]

                try:
                    audio_features = spotify.audio_features(track_id)
                except ReadTimeout:
                    print('Spotify timed out... trying again...')
                    audio_features = spotify.audio_features(track_id)

                entry  = (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'], 
                audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'], 
                audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
                audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
                
                final_data.append(entry)

            np_data = np.array(final_data)
            my_dataframe = pd.DataFrame(np_data, columns=column_names)
            my_dataframe.to_csv(f'spotify_csv_data/spotify_top_200 {date_chosen}.csv')

            now_end = datetime.now()
            end_time = now_end.strftime("%H:%M:%S")
            print(f'Main Function - Start time: {start_time}. End time: {end_time}.')
            print(f'The date {date_chosen} took {now_end - now_start} to run.')
            

    db.close() 



if __name__ == "__main__":
    now_start = datetime.now()
    start_time = now_start.strftime("%H:%M:%S")
    print(f'Script - start time: {start_time}')


    os.environ['SPOTIPY_CLIENT_ID'] = 'ENTER YOUR CLIENT_ID'
    os.environ['SPOTIPY_CLIENT_SECRET'] = 'ENTER YOUR CLIENT_SECRET'

    # Allows for retries. Seems to be enough that it doesn't crash.
    spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(), requests_timeout=10, retries=10) 
    """Leave above set."""
    
    main()

    now_end = datetime.now()
    end_time = now_end.strftime("%H:%M:%S")
    print(f'Script - Start time: {start_time}. End time: {end_time}.')
    print(f'This script took {now_end - now_start} to run.\n')

CodePudding user response：

Profile, profile, profile. But the bottleneck is likely soptify's api. Whilst you can parallelise to speed up fetching, they won't thank you much for it and you will likely find yourself rate limited if you do it too much. So profile and see what is taking the time, but be prepared to cut back on your dataset.

Ask yourself what you can do to speed up the algorithm:

can you just fetch the top N hits?
do you really need all that data?
is any data duplicated?

Even if data isn't duplicated, create a local cache, indexed by the track_id, and store every request in that. Rather than requesting from the spotify endpoint, look it up in the cache (store the data in another sqlite database, or another table in the same db). If nothing is returned, fetch, save the data to the cache, and then return it. That way:

if you are doing redundant lookups, it will be faster.
even if you aren't, you can re-run your code blazingly fast (at least as regards your current speed) if you change something and need to run the lot again.

So cache, profile, and look at your algorithm.

CodePudding user response：

A few ideas to improve performance:

Use parallel processing

Since you are using Python, the code running is single threaded.

Using Python's multiprocessing library, you could (for example) run 4 instances of the same code but with evenly divided start/end dates. This can make your data processing ~4x faster. You would just need to write the data in such a way that there is no overlap.

Note: If you are rate limited by the Spotify API (you most likely will be), you can use different API keys for each instance. (Make multiple accounts or borrow a friends API key).

Sql query optimizations

It's worth looking into your queries to see what is going wrong. I'm personally not familiar with SQL, just giving you ideas.

Profile your program to understand more.

See How can you profile a Python script?

python3

# Splits ID from Spotify link.
for entry in db_result:
    track_id = entry[4].split('/')[-1]
    entry  = (track_id,)
    data_with_track_ids.append(entry)

In this code, what type is entry? How big is db_result?

Another thing worth mentioning regarding your following code:

python3

# Calls to spotify with the new extracted track_id
for entry in data_with_track_ids:
    track_id = entry[-1]

    try:
        audio_features = spotify.audio_features(track_id)
    except ReadTimeout:
        print('Spotify timed out... trying again...')
        audio_features = spotify.audio_features(track_id)

    entry  = (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'], 
    audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'], 
    audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
    audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])
    
    final_data.append(entry)

In the try-except block, you are making a request for every entry in data_with_track_ids. How many elements in the data_with_track_ids data structure? Expect to be throttled and timed out by Spotify servers if you brute force api calls.

You should add a short wait period after timing out to reduce chances of getting rate-limited or IP banned.

If you are limited to 1000 req/day, then simply sleep program for 24 hours or stop program (and save current iteration and data context), and run again after you are allowed more requests. See https://developer.spotify.com/documentation/web-api/guides/rate-limits/

python3

import time

time.sleep(60 * 60 * 24)

Add logic to re-try request a few more times every 60 seconds before calling it a day.

CodePudding user response：

You're calling spotify.audio_features(track_id) for every single track, even if you've already fetched its data. Each Friday's results should only introduce a few new songs, but you're re-fetching information on all 200. Don't do that. Make another database table for song info. After you've fetched a track_id's info, write it to the database. Before fetching a track_id's info, see if you've already stored it in the database. Then you'll only be making the bare minimum of necessary API calls, rather than 200 * num_weeks * num_countries.