Home > Blockchain >  Pandas sum counts from list of excel files and save to new file
Pandas sum counts from list of excel files and save to new file

Time:06-12

I have a list of excel files that follow the below structure (toy):

enter image description here

data1 = [["","","0:15","during the phase of that time","during","","cat1"],["","","","","the","","cat1"],["","","","","phase","","cat2"],["","","","","of","","cat1"],["","","","","that","","cat1"],["","","","","time","","cat3"]]
data2 = [["1","1","0:19","at what point is enough enough tommy","at","","cat1"],["","","","","what","","cat2"],["","","","","point","","cat2"],["","","","","is","","cat2"],["","","","","enough","","cat1"],["","","","","enough","","cat1"],["","","","","tommy","","cat3"]]
df1 = pd.DataFrame(data1, columns=['Intro', 'Speaker','time','transcript','word','motiv','category'])
df2 = pd.DataFrame(data2, columns=['Intro', 'Speaker','time','transcript','word','motiv','category'])
df_list = [df1,df2]
names = ["file1","file2"]

for i in range(len(df_list)):
    filename = 'Filename'  names[i]   '.xlsx'
    writer = ExcelWriter(filename)
    print(filename)
    df_list[i].to_excel(writer, 'Sheet1', index=False)
    writer.save()

My goal with this list (over 200 df's) is to get summary data of specific columns and store that information into another aggregated excel document. In this toy dataset, the summary example would include a count of how many times cat1, cat2, and cat3 appeared in the "category" column.

My final output in an excel file called "results.xlsx" would look something like this: enter image description here

I have a very limited knowledge of pandas, and my code so far has been a frankenstein of other projects I've worked on that worked with lists of dataframes. So far, I have:

#Get directory of excel files (using df_list for toy dataset)
os.chdir("path")
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.xlsx"))
#Set aside list
li = []
#Create aggregate dataframe
frame = pd.DataFrame()

for f in csv_files:
    
    # read the csv file
    df = pd.read_excel(f).clean_names()
    #Set filename variable
    frame['filename'] = os.path.basename(f)
    #Set category 1 counts variable
    frame['category1'] = df.iloc[:,6].value_counts()
    #Append aggregated results to aggregate dataframe
    li.append(frame)
#Concatenate results
frame2 = pd.concat(li)

However, the results I'm getting do not at all match what I am expecting when I apply it to my actual data. Specifically, I am getting an index with the actual variable I am counting (category 1), only the very last excel file in the list of files is provided in the result, and my category 1 variable is empty: enter image description here This repeats continuously for the length of number of files in my list.

I have no clue what I am doing wrong here, any tips that could steer me in the right direction?

CodePudding user response:

Here is one approach:

from pathlib import Path

files = Path('.').glob('*.xlsx')
pd.DataFrame(pd.read_excel(f)['category'].value_counts().rename(f.stem) for f in files)

               cat1  cat2  cat3
Filenamefile1     4     1     1
Filenamefile2     3     3     1
...
  • Related