Home > Back-end >  How to loop through corresponding columns in a csv
How to loop through corresponding columns in a csv

Time:01-29

I'm trying to write a python script which reads through all the .csv files in a folder. Every .csv file contains 94 columns. I would like to loop through all the files and headers in such a way that it looks at the first column of each header, plots a single histogram containing the data from all of those first columns, then moves on to plot another single histogram containing only the data from the 2nd column, then moves on to plot another single histogram containing only the data from the 3rd column, and so on. Thus, in total it should produce 94 histograms.

I currently have code which loops a bit differently: it goes to the first file, then plots a histogram for each header in that file, then moves on to the next file, plots a histogram for each header in that file etc. Below is part of the code that does that.

dfs = []
for iteration, file in enumerate(files):
    _dfs = pd.read_csv(file)
    dfs.append(_dfs)
    print('Data is', round(100*((iteration 1)/len(files)), 0), '% loaded') #Prints how much data has been loaded so far.


'''-----------------------------------
Plotting Graphs
--------------------------------------
'''
for i in range(len(dfs)): #loops through files
    for k in dfs[i]: #loops through column headers
        plt.hist(dfs[i][k], 25)
        plt.title(files[i][22:]) #uses filename as title
        plt.xlabel(dfs[i][k].name) #uses column header for x-label
        plt.ylabel('Frequency Density')
        plt.show()

dfs is simply a list containing all the names of the files. How can I alter my script to achieve what I said in the beginning?

CodePudding user response:

If i understand you correctly. You can change the second for loop to loop through the columns of each dataframe, instead of the column headers, and then use the enumerate function to keep track of the current column number. Then, you can use that column number to create a separate histogram for each column.

for i in range(len(dfs)): #loops through files
    for j, col in enumerate(dfs[i].columns): #loops through columns
        plt.hist(dfs[i][col], 25)
        plt.title(files[i][22:]) #uses filename as title
        plt.xlabel(col) #uses column header for x-label
        plt.ylabel('Frequency Density')
        plt.show()

I hope it helps!

CodePudding user response:

94 histograms, each histogram represents a per-column aggregation of data from all dataframes.

#######################
### Plotting Graphs ###
#######################

for i in range(94):
    data = [] # store all i'th column data across all dfs
    for df in dfs:
        data.extend(list(df.iloc[:,i])) # i'th column
    
    plt.hist(data, bins=25)
    plt.title(dfs[0].iloc[:,i].name) # get name of column from 1st df
    plt.xlabel(dfs[0].iloc[:,i].name) # get name of column from 1st df
    plt.ylabel('Frequency Density')
    plt.show()
  • Related