Pandas import csv files from folder selecting older Modification date-CodePudding

I have a folder with hundredths .csv files. I am scanning the folder retrieving the path files and then one by one importing in a loop to get specific data within them. There are duplicates which I want to remove, selecting only the oldest files.

Creation and Modification dates do not resemble the dates I see on Windows 10 Explorer, but instead a date -which is the same for all of them- resembling a last root/ variation of the server. Why I cannot see the Modification Date? Without this key par I cannot continue with the code as described. Any help is welcome!

import os, glob, time # use glob to get all the csv files in the folder
    csv_files = glob.glob(os.path.join(path_input, format))
    output = pd.DataFrame(csv_files)
    output.rename(columns={output.columns[0]: "file path"}, inplace=True)
    for i in range(0, len(output)):
        output['Creation date'] = time.ctime(os.path.getctime(output['file path'][i])) # creation date
        output['Modif. date'] = time.ctime(os.path.getmtime(output['file path'][i])) # modification date

'Date modified' of the csv files I need to retrieve

I what I get is this:

file path                                                      Creation date            Modif. date  
0    S:\Production\Verspaning\Grootverspaning\Meetr...  Mon Jul 19 15:21:19 2021  Mon Jul 19 15:21:19 2021
1    S:\Production\Verspaning\Grootverspaning\Meetr...  Mon Jul 19 15:21:19 2021  Mon Jul 19 15:21:19 2021
2    S:\Production\Verspaning\Grootverspaning\Meetr...  Mon Jul 19 15:21:19 2021  Mon Jul 19 15:21:19 2021

CodePudding user response：

How about just iterating the files as basically as you can and ensure you get expected results, as a start?

I also added size so you can verify that at least differently-sized files show correctly:

import os, glob

ctime = os.path.getctime
mtime = os.path.getmtime
size = os.path.getsize

csv_files = glob.glob(os.path.join('.', '*.csv'))

print(f'File,C-Time,M-Time,Size (B)')
for csvf in csv_files: 
    print(f'{csvf},{ctime(csvf):.2f},{mtime(csvf):.2f},{size(csvf)}')

I'm running this on a directory with 1000 files with nearly identical creation times, and I can see a difference in just the 1/100ths of a second:

| File      | C-Time        | M-Time        | Size (B) |
|-----------|---------------|---------------|----------|
| ./545.csv | 1638252246.15 | 1638252246.15 | 89       |
| ./223.csv | 1638252245.86 | 1638252245.86 | 89       |
| ./237.csv | 1638252245.87 | 1638252245.87 | 89       |
| ./551.csv | 1638252246.15 | 1638252246.15 | 89       |
| ./579.csv | 1638252246.17 | 1638252246.17 | 89       |
| ./586.csv | 1638252246.18 | 1638252246.18 | 89       |
| ./592.csv | 1638252246.19 | 1638252246.19 | 89       |
| ./747.csv | 1638252246.33 | 1638252246.33 | 89       |
| ./753.csv | 1638252246.34 | 1638252246.34 | 89       |
| ./784.csv | 1638252246.36 | 1638252246.36 | 89       |

CodePudding user response：

pretty simple using pathlib pandas

import pandas as pd
import pandas as pd

trg_dir = 'path/to/your/dir'

df = pd.concat([pd.read_csv(d).assign(pth=d) for d in Path(trg_dir).glob('*.csv')])

as we've assigned a Posix object to the column, we can access its underlying attributes, such as the modified time.

df['modify_dt'] = df['pth'].apply(lambda 
                    x : pd.Timestamp(x.stat().st_mtime,unit='s'),1)

imagine we had a output as follows

`pth` is your key to de dupe by.

   A                        pth                     modify_dt
0  1  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1  2  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2  3  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0  1  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1  2  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2  3  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0  1  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1  2  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2  3  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0  1  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1  2  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2  3  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0  1  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1  2  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2  3  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
0  1  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
1  2  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224
2  3  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224


df.sort_values('modify_dt',ascending=False).drop_duplicates(subset='pth',keep='first')

   A                        pth                     modify_dt
0  1  /home/umarh/test/file.csv 2021-11-30 22:19:36.764982224

pth is your key to de dupe by.

`pth` is your key to de dupe by.