I'm currently trying to do the following:
- Check the number of time "id"/"itemName" occur across different csv files in mypath (al files are formatted the same [see below sampleCSVFile], id and name is unique in every file, but may repeat or not in other files)
- Sum value for the "Entries" column for a given id/itemName across all the files.
- Get the highest/lowest number in "Entries" column across all the files for a given id/itemName
sampleCSVFile in my mypath:
id itemName NonImportantEntries Entries SomeOtherEntries
1 item1 27 111 163
2 item2 16 22 98
My Code so far:
import pandas as pd
from os import listdir
from os.path import isfile, join
#readMyDirectoryForFiles
mypath= <myDirectoryPath>
list_of_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]
#GenerateAllFilesInDirList
list_of_files = [mypath x for x in list_of_files]
#LoadAllFilesToPandasDFs
dataframes = [pd.read_csv(fi) for fi in list_of_files]
#count id/itemName Occurences Across All Files and sum up value in column "entries", but also check for highest/lowest value in column "entries" and record it for the output file -> this step is really hard for me.
#tried the below but received -> AttributeError: 'list' object has no attribute 'groupby' , also not sure where to start and how proceed with capturing highest/lowest number for "entries" across all the files.
count_occurences = dataframes.groupby(['id', 'itemName']).sum()['entries']
...
#save file in the following format
df.to_csv('finalFile.csv', index = False)
Expected finalFile.csv format:
id itemName noOfOccurencessInFiles TotalEntriesFromAllFiles HighestEntriesNumber LowestEntriesNumber
1 item1 11 3500 1200 0
CodePudding user response:
Use concat
and add aas_index=False
for columns from id, itemName
:
final = pd.concat(dataframes).groupby(['id', 'itemName'], as_index=False)['entries'].sum()
fnal.to_csv('finalFile.csv', index = False)