Home > Software engineering >  Looping through different sample sizes
Looping through different sample sizes

Time:03-23

How can I loop through the different samples size with the aim of creating a dataframe for each so that I can be able to use in a model.
I attempted with the folllowing code but seems not to be yielding correct results. Is there an alternative way I can use in therms of different sample sizes so that they can be pass through a model.

def HiggsData_loader():
    higgs_arr = []
    X_dir2 = {}
    y_dir2 = {}
    sizes = [10000, 50000, 500000, 1000000]
    for s in  sizes:
        datasets =  pd.read_csv('./DATA/HIGGS.csv',header=None,nrows=s)
        y2 = datasets.values[:,0]
        X2 = datasets.values[:,1:]
    
        scaler = preprocessing.StandardScaler().fit(X2) #A scaler object
        X_scaled2 = scaler.transform(X2)

        higgs_arr.append('Higgs')
        X_dir2['Higgs'] = X_scaled2.copy()
        y_dir2['Higgs'] = y2.copy()
    
    return higgs_arr, X_dir2, y_dir2

I was expecting to pass through the different samples in the following code to measure the time.

md2 = {}
def processing_time(data,methods):
    for m in models:
        rd = {}
        for ds in Data_arr:
            X = X_dir[ds]
            y = y_dir[ds]
            kNN =  KNeighborsClassifier(n_neighbors=50, algorithm = m)
            t_start = time.time()
            scores = cross_val_score(kNN, X, y, cv=2)
            t = time.time()-t_start
            rd[ds] = t
            print('\n',m   " Time: ",'\n', t)
        md2[m] = rd
    return md2

CodePudding user response:

Standard rule: if you use for-loop then you need list to keep all results.

And you should

  • create list for all results before loop, i.e all_results = []
  • inside loop create new higgs_arr, X_dir2, y_dir2, add data and append all to list i.e all_results.append( [higgs_arr, X_dir2, y_dir2] )
  • at the end use return all_results

And this way you get list with many results.

I don't know how you use HiggsData_loader() in processing_time so I don't know what changes it may need - so I show only HiggsData_loader()

It could look like this.

def HiggsData_loader():

    all_results = []

    sizes = [10000, 50000, 500000, 1000000]

    for s in  sizes:
        datasets =  pd.read_csv('./DATA/HIGGS.csv',header=None,nrows=s)
        y2 = datasets.values[:,0]
        X2 = datasets.values[:,1:]
    
        scaler = preprocessing.StandardScaler().fit(X2) #A scaler object
        X_scaled2 = scaler.transform(X2)

        higgs_arr = []
        X_dir2 = {}
        y_dir2 = {}

        higgs_arr.append('Higgs')
        X_dir2['Higgs'] = X_scaled2.copy()
        y_dir2['Higgs'] = y2.copy()
            
        all_results.append( [higgs_arr, X_dir2, y_dir2] )
    
    return all_results

And later you can use as

all_results = HiggsData_loader()

for higgs_arr, X_dir2, y_dir2 in all_results:
    # ... code ...

or directly

for higgs_arr, X_dir2, y_dir2 in HiggsData_loader():
    # ... code ...

EDIT:

If you use HiggsData_loader() directly in some for-loop then you could use original version but with yield instead of return but inside loop

def HiggsData_loader():
    sizes = [10000, 50000, 500000, 1000000]
    for s in  sizes:
        datasets =  pd.read_csv('./DATA/HIGGS.csv',header=None,nrows=s)
        y2 = datasets.values[:,0]
        X2 = datasets.values[:,1:]
    
        scaler = preprocessing.StandardScaler().fit(X2) #A scaler object
        X_scaled2 = scaler.transform(X2)

        higgs_arr = []
        X_dir2 = {}
        y_dir2 = {}

        higgs_arr.append('Higgs')
        X_dir2['Higgs'] = X_scaled2.copy()
        y_dir2['Higgs'] = y2.copy()
    
        yield higgs_arr, X_dir2, y_dir2  # inside loop

And then you can run as

for higgs_arr, X_dir2, y_dir2 in HiggsData_loader():
    # ... code ...

or you may need list() to get all values

all_results = list( HiggsData_loader() )

for higgs_arr, X_dir2, y_dir2 in all_results:
    # ... code ...
  • Related