I have a folder with hundredths .csv files. I am scanning the folder retrieving the path files and then one by one importing in a loop to get specific data within them. There are duplicates which I want to remove, selecting only the oldest files.
Creation and Modification dates do not resemble the dates I see on Windows 10 Explorer, but instead a date -which is the same for all of them- resembling a last root/ variation of the server. Why I cannot see the Modification Date? Without this key par I cannot continue with the code as described. Any help is welcome!
import os, glob, time # use glob to get all the csv files in the folder
csv_files = glob.glob(os.path.join(path_input, format))
output = pd.DataFrame(csv_files)
output.rename(columns={output.columns[0]: "file path"}, inplace=True)
for i in range(0, len(output)):
output['Creation date'] = time.ctime(os.path.getctime(output['file path'][i])) # creation date
output['Modif. date'] = time.ctime(os.path.getmtime(output['file path'][i])) # modification date
'Date modified' of the csv files I need to retrieve
I what I get is this:
file path Creation date Modif. date
0 S:\Production\Verspaning\Grootverspaning\Meetr... Mon Jul 19 15:21:19 2021 Mon Jul 19 15:21:19 2021
1 S:\Production\Verspaning\Grootverspaning\Meetr... Mon Jul 19 15:21:19 2021 Mon Jul 19 15:21:19 2021
2 S:\Production\Verspaning\Grootverspaning\Meetr... Mon Jul 19 15:21:19 2021 Mon Jul 19 15:21:19 2021
CodePudding user response:
How about just iterating the files as basically as you can and ensure you get expected results, as a start?
I also added size so you can verify that at least differently-sized files show correctly:
import os, glob
ctime = os.path.getctime
mtime = os.path.getmtime
size = os.path.getsize
csv_files = glob.glob(os.path.join('.', '*.csv'))
print(f'File,C-Time,M-Time,Size (B)')
for csvf in csv_files:
print(f'{csvf},{ctime(csvf):.2f},{mtime(csvf):.2f},{size(csvf)}')
I'm running this on a directory with 1000 files with nearly identical creation times, and I can see a difference in just the 1/100ths of a second:
| File | C-Time | M-Time | Size (B) |
|-----------|---------------|---------------|----------|
| ./545.csv | 1638252246.15 | 1638252246.15 | 89 |
| ./223.csv | 1638252245.86 | 1638252245.86 | 89 |
| ./237.csv | 1638252245.87 | 1638252245.87 | 89 |
| ./551.csv | 1638252246.15 | 1638252246.15 | 89 |
| ./579.csv | 1638252246.17 | 1638252246.17 | 89 |
| ./586.csv | 1638252246.18 | 1638252246.18 | 89 |
| ./592.csv | 1638252246.19 | 1638252246.19 | 89 |
| ./747.csv | 1638252246.33 | 1638252246.33 | 89 |
| ./753.csv | 1638252246.34 | 1638252246.34 | 89 |
| ./784.csv | 1638252246.36 | 1638252246.36 | 89 |
CodePudding user response:
pretty simple using pathlib pandas
import pandas as pd
import pandas as pd
trg_dir = 'path/to/your/dir'
df = pd.concat([pd.read_csv(d).assign(pth=d) for d in Path(trg_dir).glob('*.csv')])
as we've assigned a Posix
object to the column, we can access its underlying attributes, such as the modified time.
df['modify_dt'] = df['pth'].apply(lambda
x : pd.Timestamp(x.stat().st_mtime,unit='s'),1)
imagine we had a output as follows
pth
is your key to de dupe by.
A pth modify_dt
0 1 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1 2 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2 3 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0 1 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1 2 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2 3 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0 1 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1 2 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2 3 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0 1 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1 2 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2 3 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0 1 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1 2 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2 3 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0 1 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1 2 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2 3 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
df.sort_values('modify_dt',ascending=False).drop_duplicates(subset='pth',keep='first')
A pth modify_dt
0 1 /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224