Improving the runtime of a pandas loop-CodePudding

I am actively running some Python code in jupyter on a df consisting of about 84k rows. I'm estimating this is going to take somewhere in the neighborhood of 9 hours at this rate. My code is below, I have read that ideally one would vectorize for max speed but being sort of new to Python and coding in general, I'm not sure how I can go about changing the below code to vectorize it. The goal is to look at the value in the first column of the dataframe and add that value to the end of a url. I then check the first line in the url and compare it to some predetermined values to find out if there is a match. Any advice would be greatly appreciated!

#Python 3

import pandas as pd
import urllib

no_res = "Item Not Found"
error = "Does Not Exist"

for i in df1.index:

   path = 'http://xxxx/xxx/xxx.pl?part='   str(df1['ITEM_ID'][i])
   parsed_path = path.replace(' ',' ')
   f = urllib.request.urlopen(parsed_path)
   raw = str(f.read().decode("utf-8"))
   lines = raw.split('\n')
   r = lines[0]

   if r == no_res:
      sap = 'NO'
   elif r == error:
      sap = 'ERROR'
   else:
      sap = 'YES'

df1["item exists"][i] = sap
df1["Path"][i] = path
df1["URL return value"][i] = r

Edit adding test code below

import concurrent.futures
import pandas as pd
import urllib
import numpy as np

def my_func(df_row):
    no_res = "random"
    error = "entered"
    path = "http://www.google.com"
    parsed_path = path.replace(' ',' ')
    f = urllib.request.urlopen(parsed_path)
    raw = str(f.read().decode("utf-8"))
    lines = raw.split('\n')
    r = df_row['0']

    if r == no_res:
        sap = "NO"
    elif r == error:
        sap = "ERROR"
    else:
        sap = "YES"

    df_row['4'] = sap
    df_row['5'] = lines[0]
    df_row['6'] = r
   
n = 1000
my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
my_df['4'] = ""
my_df['5'] = ""
my_df['6'] = ""
my_df = my_df.apply(lambda col: col.astype('category'))

executor = concurrent.futures.ProcessPoolExecutor(8)
futures = [executor.submit(my_func, row) for _,row in my_df.iterrows()]
concurrent.futures.wait(futures)

This is throwing the following error (shortened):

DoneAndNotDoneFutures(done={<Future at 0x1cfe4938040 state=finished raised BrokenProcessPool>, <Future at 0x1cfe48b8040 state=finished raised BrokenProcessPool>,

CodePudding user response：

Since you are doing some outside operation with a URL, I do not think vectorization is a solution (let possible).

The bottleneck of your operation is the following line

f = urllib.request.urlopen(parsed_path)

This line waits for the response and is blocking, as mentioned your operation is I/O bound. The CPU can start other jobs while waiting for the response. The solution to address this is using concurrency.

Edit: My original answer was using python built-in multi threading which was problematic. The best way to do multiprocessing/threading with pandas data frame is using "dask" library.

The following code is tested with the dummy data set on my PC and on average speeds up the naive for loop by ~ 12 times.

#%%
import time
import urllib.request

import pandas as pd
import numpy as np

import dask.dataframe as dd

def my_func(df_row):
    df_row = df_row.copy()
    no_res = "random"
    error = "entered"
    path = "http://www.google.com"
    parsed_path = path.replace(' ',' ')
    f = urllib.request.urlopen(parsed_path)
    # I had to change the encoding on my machine.
    raw = str(f.read().decode("'windows-1252")) 
    lines = raw.split('\n')
    r = df_row[0]

    if r == no_res:
        sap = "NO"
    elif r == error:
        sap = "ERROR"
    else:
        sap = "YES"

    df_row['4'] = sap
    df_row['5'] = lines[0]
    df_row['6'] = r
    return df_row

def run():
    print("started.")
    n = 1000
    my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
    my_df = my_df.apply(lambda col: col.astype('category'))
    my_df['4'] = ""
    my_df['5'] = ""
    my_df['6'] = ""

    # Literally dask partitions the original dataframe into
    # npartitions chunks and use them in apply function
    # in parallel.
    my_ddf = dd.from_pandas(my_df, npartitions=15)
    start = time.time()
    q = my_ddf.apply(my_func, axis= 1, meta=my_ddf)
    # num_workers is number of threads used,
    print(q.compute(num_workers= 50))
    time_end = time.time()
    print(f"Elapsed: {time_end - start:10.2f}")

if __name__ == "__main__":
    run()

dask provides many other tools and options to facilitate concurrent processing and it would be a good idea to take a look at its documentation to investigate other options.

P.S. : if you run the above code too many times on google you will receive "HTTP Error 429: Too Many Requests". This happens to prevent something like DDoS attack on a public server. So, if for your real job you are querying a public website, you may end up receiving the same 429 response, if you try 84K queries in a short time.

CodePudding user response：

You are looping to get URL every time meaning 84k times. Put it before the for loop