I am saving a large amount of data from some Monte Carlo simulations. I simulate 20 things over a period of 10 time steps using a varying of random draws. So, for a given number of random draws, I have have a folder with 10 .csv files (one for each time step) which has 20 columns of data and n rows per column, where n is the number of random draws in that simulation. Currently my basic code for loading data in looks something like this:
import pandas as pd
import numpy as np
load_path = r'...\path\to\data'
numScenarios = [100, 500, 1000, 2500, 5000, 10000, 20000]
yearsSimulated = np.arange(1,11)
for n in numScenarios:
folder_path = load_path '\draws = ' str(n)
for year in yearsSimulated:
filename = '\year ' str(year) '.csv'
path = folder_path filename
df = pd.read_csv(path)
# save df.describe() somewhere
I want to efficiently save df.describe()
somehow so that I can compare how the number of random draws is affecting results for the 20 things for a given time step. That is, I would ultimately like some object that I can access easily that will store all the df.describe()
outputs for each individual time step. I'm not sure of a nice way to do this though. Some previous questions seem to suggest that dictionaries may be the way to go here but I've not been able to get them going.
CodePudding user response:
Edit:
My final approach is to use an answer to a question here with a bunch of loops. So now I have:
class ngram(dict):
"""Based on perl's autovivification feature."""
def __getitem__(self, item):
try:
return super(ngram, self).__getitem__(item)
except KeyError:
value = self[item] = type(self)()
return value
results = ngram()
for i, year in enumerate(years):
year_str = str(year)
ann_stats = pd.DataFrame()
for j, n in enumerate(numScenarios):
n_str = str(n)
folder_path = load_path '\draws = ' str(n)
filename = '\scenarios ' str(year) '.csv'
path = folder_path filename
df = pd.read_csv(path)
ann_stats['mean'] = df.mean()
ann_stats['std. dev'] = df.std()
ann_stats['1%'] = df.quantile(0.01)
ann_stats['25%'] = df.quantile(0.25)
ann_stats['50%'] = df.quantile(0.5)
ann_stats['75%'] = df.quantile(0.75)
ann_stats['99%'] = df.quantile(0.99)
results[year_str][n_str] = ann_stats.T
And so now the summary data for each time step and number of draws is accessed as a dataframe with
test = results[year_str][n_str]
where the columns of test
hold results for each of my 20 things.