Home > Enterprise >  Group by to count occurrences of unique ID/itemName across multiple CSV files and sum up value in an
Group by to count occurrences of unique ID/itemName across multiple CSV files and sum up value in an

Time:11-13

I'm currently trying to do the following:

  1. Check the number of time "id"/"itemName" occur across different csv files in mypath (al files are formatted the same [see below sampleCSVFile], id and name is unique in every file, but may repeat or not in other files)
  2. Sum value for the "Entries" column for a given id/itemName across all the files.
  3. Get the highest/lowest number in "Entries" column across all the files for a given id/itemName

sampleCSVFile in my mypath:

id    itemName    NonImportantEntries    Entries    SomeOtherEntries
1      item1              27              111             163
2      item2              16               22              98

My Code so far:

import pandas as pd

from os import listdir
from os.path import isfile, join

#readMyDirectoryForFiles
mypath= <myDirectoryPath>
   
list_of_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]

#GenerateAllFilesInDirList
list_of_files = [mypath x for x in list_of_files]

#LoadAllFilesToPandasDFs
dataframes = [pd.read_csv(fi) for fi in list_of_files]

#count id/itemName Occurences Across All Files and sum up value in column "entries", but also check for highest/lowest value in column "entries" and record it for the output file -> this step is really hard for me.

#tried the below but received -> AttributeError: 'list' object has no attribute 'groupby' , also not sure where to start and how proceed with capturing highest/lowest number for "entries" across all the files.
count_occurences = dataframes.groupby(['id', 'itemName']).sum()['entries']



...
#save file in the following format
df.to_csv('finalFile.csv', index = False)

Expected finalFile.csv format:

    id  itemName    noOfOccurencessInFiles  TotalEntriesFromAllFiles    HighestEntriesNumber    LowestEntriesNumber
    1     item1            11                       3500                       1200                      0

CodePudding user response:

Use concat and add aas_index=False for columns from id, itemName:

final = pd.concat(dataframes).groupby(['id', 'itemName'], as_index=False)['entries'].sum()

fnal.to_csv('finalFile.csv', index = False)
  • Related