I have a list containing 12 Pandas dataframes. Each dataframe has 94 columns. How can I create a loop which creates 94 new dataframes? The first new dataframe (lets call it df1 for simplicity) should have dfs[1]['column 1']
in its first column, then dfs[2]['column 1']
in its second column, then dfs[3]['column 1']
in its third column. The second new dataframe (df2) should have dfs[1]['column 2']
in its first column, then dfs[2]['column 2']
in its second column etc..
This means I can then plot each column of df1 onto a single histogram, then each column of df2 onto a new single histogram etc, such that I would be able to plot 94 histograms at the end (I can do this myself by simply appending each new dataframe into a list, then doing a nested for-loop through the dataframes and column headers)
Right now, my code produces the right number of histogram plots (94) but instead of producing new pandas dataframes, it puts all the data in a list. This means that on each plot, I can't differentiate between the different histograms on the same plot as they are all the same colour.
for i in range(94):
data = [] # store all i'th column data across all dfs
for df in dfs:
data.extend(list(df.iloc[:,i])) # i'th column
plt.hist(data, bins=50, histtype = 'step', label = dfs[0].iloc[:,i].name)
plt.title(dfs[0].iloc[:,i].name) # get name of column from 1st df
plt.xlabel(dfs[0].iloc[:,i].name) # get name of column from 1st df
plt.ylabel('Frequency Density')
plt.show()
dfs
is the list containing all of the original 12 dataframes.
CodePudding user response:
You might think that to save memory it would be a good idea to avoid building all 94 new dataframes simultaneously. However, Pandas does not have to make copies of the underlying data (unless you want it to) so the new dataframes can simply contain views into the data from original 12. Here's how to do that:
new_dfs = {}
columns = range(len(dfs)) # specify column names
for col in dfs[0].columns:
new_df = pd.concat([df[col] for df in dfs], axis=1,
keys=columns, copy=False)
new_dfs[col] = new_df
Update
I added the keyword argument copy=False
to pd.concat
because by default it does make a copy of the data.
Here's the code to plot the new data
for name, df in new_dfs.items():
for col in df:
plt.hist(df[col], bins=50, histtype='step', label=col)
plt.title(name) # name of column
plt.xlabel(name)
plt.ylabel('Frequency Density')
plt.legend()
plt.show()
CodePudding user response:
I came up with a better answer than my previous one. You can concatenate all the dataframes with a multi-index and use the multiindex to get a view of all the columns with the same name.
df_index = range(len(dfs))
df_combined = pd.concat(dfs, axis=1, keys= df_index, copy=False)
columns = df_combined.columns.levels[1].tolist()
idx = pd.IndexSlice
for col in columns:
df = df_combined.loc[:, idx[:, col]] # select all columns matching col
df.columns = df.columns.droplevel(1)
for name in df:
plt.hist(df[name], bins=50, histtype='step', label=name)
plt.title(col) # name of column
plt.xlabel(col)
plt.ylabel('Frequency Density')
plt.legend()
plt.show()