Python: How to implement concurrent futures to a function-CodePudding

I was wonder what would be a good way to implement Concurrent Futures to iterate through a large list of stocks for New Program.

On my previous program, I tried using concurrent futures but when printing the data it was not consistent. For example when running a large list of stocks, it will give different information each time(As you can see for Output 1 and 2 for the previous program). I wanted to provide my previous program to see what I did wrong with implementing concurrent futures.

Thanks!

New Program

tickers = ["A","AA","AAC","AACG","AACIU","AADI","AAIC","AAIN","AAL","AAMC","AAME","AAN","AAOI","AAON","AAP","AAPL"]
def create_df(tickers):
    all_info = []
    for ticker in tickers:
        all_info.append(yf.Ticker(ticker).info)
        
    df = pd.DataFrame.from_records(all_info)
    df = df[['symbol','ebitda', 'enterpriseValue', 'trailingPE', 'sector']]
    df.dropna(inplace=True)
    # This is where you can add calculations and other columns not in Yfinance Library
    df['EV/Ratio'] = df['enterpriseValue'] / df['ebitda']
    return df

df = create_df(tickers)
print(df)
print('It took', time.time()-start, 'seconds.')

Output

   symbol        ebitda  enterpriseValue  trailingPE              sector   EV/Ratio
0       A  1.762000e 09     5.311271e 10   60.754720          Healthcare  30.143422
9    AAMC -2.015600e 07     1.971329e 08    1.013164  Financial Services  -9.780359
10   AAME  2.305600e 07     1.175756e 08    7.652329  Financial Services   5.099566
11    AAN  8.132960e 08     1.228469e 09    9.329710   Consumer Cyclical   1.510483
13   AAON  1.178790e 08     3.501286e 09   55.615944         Industrials  29.702376
14    AAP  1.239876e 09     1.609877e 10   25.986680   Consumer Cyclical  12.984181
15   AAPL  1.109350e 11     2.489513e 12   33.715443          Technology  22.441190
It took 101.81006002426147 seconds.

Previous Program For Reference

tickers = ["A","AA","AAC","AACG","AACIU","AADI","AAIC","AAIN","AAL","AAMC","AAME","AAN","AAOI","AAON","AAP","AAPL"]
start = time.time()

col_a = []  
col_b = []  
col_c = []  
col_d = []  

print('Lodaing Data... Please wait for results')


def do_something(tickers):
    print('---', tickers, '---')
    all_info = yf.Ticker(tickers).info
    try:
        a = all_info.get('ebitda')
        b = all_info.get('enterpriseValue')
        c = all_info.get('trailingPE')
        d = all_info.get('sector')
    except:
        None
    col_a.append(a)  
    col_b.append(b)  
    col_c.append(c)  
    col_d.append(d)     
    return
with concurrent.futures.ThreadPoolExecutor() as executer:
    executer.map(do_something, tickers)
        

# Dataframe Set Up
pd.set_option("display.max_rows", None)
   
df = pd.DataFrame({
    'Ticker': tickers,
    'Ebitda': col_a,  
    'EnterpriseValue' :col_b,  
    'PE Ratio': col_c,  
    'Sector': col_d,
})
print(df.dropna())
print(len('Total Companies with Information'))
print('It took', time.time()-start, 'seconds.')

Output 1 for Previous Program

   Ticker        Ebitda  EnterpriseValue   PE Ratio              Sector
1      AA  1.651000e 09     5.031802e 10  49.183292          Healthcare
3    AACG  2.216000e 09     1.168140e 10  11.711775     Basic Materials
5    AADI  1.928800e 07     1.108360e 08   6.954397  Financial Services
7    AAIN  1.128370e 08     3.960835e 09  57.706764         Industrials
8     AAL  8.303301e 08     1.103969e 09   9.111819   Consumer Cyclical
10   AAME  1.202330e 11     2.534678e 12  26.737967          Technology
12   AAOI -1.848400e 07     1.277540e 08   0.355233  Financial Services
14    AAP  1.224954e 09     1.770882e 10  26.059464   Consumer Cyclical
32
It took 4.2548089027404785 seconds.

Output 2 for Previous Program

   Ticker        Ebitda  EnterpriseValue   PE Ratio              Sector
0       A -1.848400e 07     1.277540e 08   0.355233  Financial Services
4   AACIU  1.202330e 11     2.534678e 12  26.737967          Technology
5    AADI  1.651000e 09     5.031802e 10  49.183292          Healthcare
7    AAIN  1.128370e 08     3.960835e 09  57.706764         Industrials
9    AAMC  8.303301e 08     1.103969e 09   9.111819   Consumer Cyclical
10   AAME  2.216000e 09     1.168140e 10  11.711775     Basic Materials
13   AAON  1.224954e 09     1.770882e 10  26.059464   Consumer Cyclical
14    AAP  1.928800e 07     1.108360e 08   6.954397  Financial Services
32
It took 4.003742933273315 seconds.

CodePudding user response：

You have a multithreaded program. The function ThreadPoolExecutor.map launches a number of threads that will run concurrently. Each thread consists of one call to do_something(), but you do not have any control over the order in which these threads execute or finish. The problem occurs because you append the results (a, b, c, d) to the individual lists col_a, col_b etc. inside do_something. Those lists are global, so the data gets appended to them in more-or-less random order. It is even possible that a thread switch occurs right in the middle of the four calls to append(). So the order of the data will be random, and the individual rows might be messed up.

The list of ticker symbols is added to the dataframe in the main thread. So the list of symbols and the data itself are not synchronized. That's exactly what you observe.

The easiest solution is to set up all your data structures in the main thread. This is easy to do because the function map() returns an iterator, and the order of iteration is guaranteed to be preserved. The iterator steps over the values returned by do_something(). So instead of trying to update the lists col_a, col_b, etc. in that function, just return the values a, b, c, d as a tuple. Back in your main thread, you take these values and append them to the columns.

The order of execution of the different threads is not controlled, but map() sorts it out for you; it collects all the results first, and then steps through them in order.

Change this part of your program - everything else can stay the same.

def do_something(tickers):
    print('---', tickers, '---')
    all_info = yf.Ticker(tickers).info
    try:
        a = all_info.get('ebitda')
        b = all_info.get('enterpriseValue')
        c = all_info.get('trailingPE')
        d = all_info.get('sector')
    except:
        return None, None, None, None  # must return a 4-tuple
    return a, b, c, d

with concurrent.futures.ThreadPoolExecutor() as executer:
    for a, b, c, d in executer.map(do_something, tickers):
        col_a.append(a)  
        col_b.append(b)  
        col_c.append(c)  
        col_d.append(d)

CodePudding user response：

Here is the answer on how to implement multithreading to New Function provided by @iudeen

import pandas as pd
import yfinance as yf
from concurrent.futures import ThreadPoolExecutor
import time
from stocks import tickers
start = time.time()



print('Lodaing Data... Please wait for results')
all_info = []
def create_df(ticker):
    all_info.append(yf.Ticker(ticker).info)
    
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(create_df, x) for x in tickers]

df = pd.DataFrame.from_records(all_info)
df = df[['symbol','ebitda', 'enterpriseValue', 'trailingPE', 'sector']]
df.dropna(inplace=True)
df['EV/Ratio'] = df['enterpriseValue'] / df['ebitda']
print(df)
print('It took', time.time()-start, 'seconds.')