python multithreading while waiting for API call-CodePudding

Detailed: I'm using python to make a small app that scrapes a top 100 song list and creates a spotify playlist from it. I'm bottlenecked by the fact that the spotify API only lets you search by one song at a time (to get its internal spotify ID).

Short: I tried multithreading with mixed results.

For reference, this is what search song does, not entirely relevant:

    def __search_song(self, song: str):
    result = self.sp.search(song   " NOT Karaoke", limit=1, type="track")
    try:
        sid = result["tracks"]["items"][0]["uri"]
    except IndexError:
        pass
    else:
        self.song_list.append(sid)

Initial implementation:

def __populate_playlist(self, song_list: list, pid: str):
    for song in song_list:
        self.__search_song(song)

    self.sp.playlist_add_items(pid, self.song_list)

This was normal execution, "one after another", it worked fine, but it was slow and it made the window hang because of the UI (Tkinter needs to refresh constantly).

Multithreading using threading and queue:

q = queue.Queue()


def __worker():
    while True:
        item = q.get()
        q.task_done()


threading.Thread(target=__worker, daemon=True).start()

def __populate_playlist(self, song_list: list, pid: str):
    for song in song_list:
        q.put(self.__search_song(song))

    q.join()
    self.sp.playlist_add_items(pid, self.song_list)

This worked, however, it was marginally faster than the original. It did fix the issue with the program appearing to not respond, but it was not fast enough.

I then tried to drop the queue and implement unordered threading.

def __populate_playlist(self, song_list: list, pid: str):
    # multiprocessing support
    threads = []
    for song in song_list:
        t = threading.Thread(target=self.__search_song, args=(song, ))
        threads.append(t)

    for thread in threads:
        thread.start()

    for thread in threads:
        thread.join()

    self.sp.playlist_add_items(pid, self.song_list)

and this was very fast, I'm talking a reduction from 23 to 8 seconds. Obviously this has the unintended consequence that the playlist is shuffled, and it's not a real top 100 anymore.

My question is simple, is there an issue with my implementation of a queue, or does using a queue system provide this much overhead inherently? This is the first time ever I've implemented multithreading in an application, so I might be missing something.

To iterate over the use case once more, I don't really care which one finishes first, as long as the order is maintained. I thought about storing the initial order of the list and using a dictionary to hold its order and spotify ID, but I'm still thinking about the actual implementation of that.

CodePudding user response：

As mentioned it's very difficult to guarantee an order if you want asynchronous calls. But a simple implementation of mapping the ID to the name of the song would be:

def __search_song(self, song: str):

    result = self.sp.search(song   " NOT Karaoke", limit=1, type="track")
    
    try:
        sid = result["tracks"]["items"][0]["uri"]
    except IndexError:
        pass
    else:
        self.song_list.append(sid)
        self.song_to_sid[song] = sid

Given that the dict song_to_sid is instanciated in you class. If you then just iterate over you first map (if that is in order) you may append the mapped sid to have an ordered playlist.

After you have run the __populate_playlist function you can do:

top_hundred_playlist = []
for song_id in self.song_list:
    top_hundred_playlist(self.song_to_sid[song_id])

CodePudding user response：

The final code, with the help of Albin Sidås ended up looking like this:

def __search_song(self, song: str):
    """searches a song by a string song name returns spotify URI id"""
    result = self.sp.search(song   " NOT Karaoke", limit=1, type="track")
    try:
        sid = result["tracks"]["items"][0]["uri"]
    except IndexError:
        self.song_to_sid[song] = ""
    else:
        self.song_to_sid[song] = sid

def __populate_playlist(self, song_list: list, pid: str):
    # multiprocessing support
    top_hundred_ids = []
    threads = []
    for song in song_list:
        self.song_list_names.append(song)
        t = threading.Thread(target=self.__search_song, args=(song, ))
        threads.append(t)

    for thread in threads:
        thread.start()

    for thread in threads:
        thread.join()

    for song_id in self.song_list_names:
        song_value = self.song_to_sid[song_id]
        if song_value != "":
            top_hundred_ids.append(song_value)

    self.sp.playlist_add_items(pid, top_hundred_ids)

It ended up being just a second slower than the fully async solution, so I consider this the winning approach. I'm still open to any clarification as to the overhead of a queue system, but all in all, this is great.