How can I loop through the different samples size with the aim of creating a dataframe for each so that I can be able to use in a model.
I attempted with the folllowing code but seems not to be yielding correct results. Is there an alternative way I can use in therms of different sample sizes so that they can be pass through a model.
def HiggsData_loader():
higgs_arr = []
X_dir2 = {}
y_dir2 = {}
sizes = [10000, 50000, 500000, 1000000]
for s in sizes:
datasets = pd.read_csv('./DATA/HIGGS.csv',header=None,nrows=s)
y2 = datasets.values[:,0]
X2 = datasets.values[:,1:]
scaler = preprocessing.StandardScaler().fit(X2) #A scaler object
X_scaled2 = scaler.transform(X2)
higgs_arr.append('Higgs')
X_dir2['Higgs'] = X_scaled2.copy()
y_dir2['Higgs'] = y2.copy()
return higgs_arr, X_dir2, y_dir2
I was expecting to pass through the different samples in the following code to measure the time.
md2 = {}
def processing_time(data,methods):
for m in models:
rd = {}
for ds in Data_arr:
X = X_dir[ds]
y = y_dir[ds]
kNN = KNeighborsClassifier(n_neighbors=50, algorithm = m)
t_start = time.time()
scores = cross_val_score(kNN, X, y, cv=2)
t = time.time()-t_start
rd[ds] = t
print('\n',m " Time: ",'\n', t)
md2[m] = rd
return md2
CodePudding user response:
Standard rule: if you use for
-loop then you need list to keep all results.
And you should
- create list for all results before loop, i.e
all_results = []
- inside loop create new
higgs_arr
,X_dir2
,y_dir2
, add data and append all to list i.eall_results.append( [higgs_arr, X_dir2, y_dir2] )
- at the end use
return all_results
And this way you get list with many results.
I don't know how you use HiggsData_loader()
in processing_time
so I don't know what changes it may need - so I show only HiggsData_loader()
It could look like this.
def HiggsData_loader():
all_results = []
sizes = [10000, 50000, 500000, 1000000]
for s in sizes:
datasets = pd.read_csv('./DATA/HIGGS.csv',header=None,nrows=s)
y2 = datasets.values[:,0]
X2 = datasets.values[:,1:]
scaler = preprocessing.StandardScaler().fit(X2) #A scaler object
X_scaled2 = scaler.transform(X2)
higgs_arr = []
X_dir2 = {}
y_dir2 = {}
higgs_arr.append('Higgs')
X_dir2['Higgs'] = X_scaled2.copy()
y_dir2['Higgs'] = y2.copy()
all_results.append( [higgs_arr, X_dir2, y_dir2] )
return all_results
And later you can use as
all_results = HiggsData_loader()
for higgs_arr, X_dir2, y_dir2 in all_results:
# ... code ...
or directly
for higgs_arr, X_dir2, y_dir2 in HiggsData_loader():
# ... code ...
EDIT:
If you use HiggsData_loader()
directly in some for
-loop then you could use original version but with yield
instead of return
but inside loop
def HiggsData_loader():
sizes = [10000, 50000, 500000, 1000000]
for s in sizes:
datasets = pd.read_csv('./DATA/HIGGS.csv',header=None,nrows=s)
y2 = datasets.values[:,0]
X2 = datasets.values[:,1:]
scaler = preprocessing.StandardScaler().fit(X2) #A scaler object
X_scaled2 = scaler.transform(X2)
higgs_arr = []
X_dir2 = {}
y_dir2 = {}
higgs_arr.append('Higgs')
X_dir2['Higgs'] = X_scaled2.copy()
y_dir2['Higgs'] = y2.copy()
yield higgs_arr, X_dir2, y_dir2 # inside loop
And then you can run as
for higgs_arr, X_dir2, y_dir2 in HiggsData_loader():
# ... code ...
or you may need list()
to get all values
all_results = list( HiggsData_loader() )
for higgs_arr, X_dir2, y_dir2 in all_results:
# ... code ...