Home > Software design >  Python: Sort raw data files before loading
Python: Sort raw data files before loading

Time:08-13

I want to load multiple raw data files (csv). The machine I am extracting the data from writes a time stamp into the file names, e.g. ABC_202211_10_10_10 (10.10 am). The problem is that a file which was created earlier on the same day, for example ABC_202211_9_10_10 (9.10 am), is being loaded after the 10.10 am file. I use the following to load the files.

for files in filenames:
    name_only = Path(files).stem
    df = pd.read_csv(files, skiprows=10, sep=";")

Is there a way to sort the files before reading the csv files, perhaps splitting the file name, sort it according to the time and then load the raw data files based on this newly sorted list? The date is apparently not an issue, e.g. files with 202212 in the name are getting loaded after 202211.

CodePudding user response:

Further to my original comment:

Using the natsort library will sort the filenames naturally, e.g. 9 will come before 10 - rather than a lexical (or alphabetical) sort where 10 (starting with 1) comes before 9.

This will be an easy approach, as you won't have to write your own parser and sorter. No need to re-invent.

For example:

from natsort import natsorted

files = ['ABC_202211_9_10_10', 
         'ABC_202211_8_10_10', 
         'ABC_202211_12_10_10', 
         'ABC_202211_11_10_10', 
         'ABC_202211_10_10_10']

natsorted(files)

Output:

['ABC_202211_8_10_10',
 'ABC_202211_9_10_10',
 'ABC_202211_10_10_10',
 'ABC_202211_11_10_10',
 'ABC_202211_12_10_10']

Further comments:

For completeness, using the built-in sorted function or even the list.sort() function will sort the filenames alphabetically as:

['ABC_202211_10_10_10',
 'ABC_202211_11_10_10',
 'ABC_202211_12_10_10',
 'ABC_202211_8_10_10',
 'ABC_202211_9_10_10']
  • Related