I want to load multiple raw data files (csv). The machine I am extracting the data from writes a time stamp into the file names, e.g. ABC_202211_10_10_10 (10.10 am). The problem is that a file which was created earlier on the same day, for example ABC_202211_9_10_10 (9.10 am), is being loaded after the 10.10 am file. I use the following to load the files.
for files in filenames:
name_only = Path(files).stem
df = pd.read_csv(files, skiprows=10, sep=";")
Is there a way to sort the files before reading the csv files, perhaps splitting the file name, sort it according to the time and then load the raw data files based on this newly sorted list? The date is apparently not an issue, e.g. files with 202212 in the name are getting loaded after 202211.
CodePudding user response:
Further to my original comment:
Using the natsort
library will sort the filenames naturally, e.g. 9 will come before 10 - rather than a lexical (or alphabetical) sort where 10 (starting with 1) comes before 9.
This will be an easy approach, as you won't have to write your own parser and sorter. No need to re-invent.
For example:
from natsort import natsorted
files = ['ABC_202211_9_10_10',
'ABC_202211_8_10_10',
'ABC_202211_12_10_10',
'ABC_202211_11_10_10',
'ABC_202211_10_10_10']
natsorted(files)
Output:
['ABC_202211_8_10_10',
'ABC_202211_9_10_10',
'ABC_202211_10_10_10',
'ABC_202211_11_10_10',
'ABC_202211_12_10_10']
Further comments:
For completeness, using the built-in sorted
function or even the list.sort()
function will sort the filenames alphabetically as:
['ABC_202211_10_10_10',
'ABC_202211_11_10_10',
'ABC_202211_12_10_10',
'ABC_202211_8_10_10',
'ABC_202211_9_10_10']