Home > Mobile >  Delete duplicate file based on modified time, remaining first created file
Delete duplicate file based on modified time, remaining first created file

Time:07-08

First I have video files that record from webcam camera. It will got many file of videos but I want to delete duplicate file base on modification time, limited by minutes.

For example, I have 3 video files as below. base on (hour : minute : second)

  1. Ek001.AVI - time modification of file is 08:30:15
  2. Ek002.AVI - time modification of file is 08:30:40
  3. Ek003.AVI - time modification of file is 08:32:55

I want to get remains output.

  1. Ek001.AVI - time modification of file is 08:30:15 (first file created remaining)
  2. Ek003.AVI

Now I have code for find modification time as below.

import os
import datetime
import glob
from datetime import datetime
      
for file in glob.glob('C:\\Users\\xxx\\*.AVI'):
    time_mod = os.path.getmtime(file)     
    print (datetime.fromtimestamp(time_mod).strftime('%Y-%m-%d %H:%M:%S'),'-->',file)

Please supporting me to adapt my code for delete duplicate file based on modified time, limited by minutes.

CodePudding user response:

Here is my suggested solution. See the comments in the code itself for an detailed explanation, but the basic idea is that you build up a nested dictionary of lists of 2-element tuples, where the keys of the dictionary are the number of minutes since the start of Unix time, and the 2-tuples contain the filename and the remaining seconds. You then loop over the values of the dictionary (lists of tuples for files created within the same calendar minute), sort these by the seconds, and delete all except the first.

The use of a defaultdict here is just a convenience to avoid the need to explicitly add new lists to the dictionary when looping over files, because these will be added automatically when needed.

import os
import glob
from collections import defaultdict

files_by_minute = defaultdict(list)

# group together all the files according to the number of minutes since the
# start of Unix time, storing the filename and the number of remaining seconds
for filename in glob.glob("C:\\Users\\xxx\\*.AVI"):
    time_mod = os.path.getmtime(filename)
    mins = time_mod // 60
    secs = time_mod % 60
    files_by_minute[mins].append((filename, secs))

# go through each of these lists of files, removing the newer ones if
# there is more than one
for fileset in files_by_minute.values():
    if len(fileset) > 1:
        # sort tuples by second element (i.e. the seconds)
        fileset.sort(key=lambda t:t[1])
        # remove all except the first
        for file_info in fileset[1:]:
            filename = file_info[0]
            print(f"removing {filename}")
            os.remove(filename)

CodePudding user response:

I think you can solve this by using a set. Convert the Unix time (mtime) to integer minutes, then iterate a sorted (ascending order) sequence. If a number is in the set, you already have a file for that minute (delete the file). If not, add the number to the set. Here's how this can look in principle:

ts = [83015, 83145, 83045, 83115]

s = set()
for t in sorted(ts):
    # to minute; note that it would be //60 if using Unix time (seconds)
    mins = t//100
    if mins in s:
        print(f"delete {t}")
    else:
        s.add(mins)

# delete 83045
# delete 83145

In practice, that could look like

from datetime import datetime
from pathlib import Path

src = Path('...') # insert your path
files = sorted(src.glob('...'), key=lambda p: p.stat().st_mtime) # use your search pattern

s = set()
for f in files:
    mins = int(f.stat().st_mtime)//60
    if mins in s:
        print(f"delete {f}")
        print(datetime.fromtimestamp(f.stat().st_mtime).strftime('%Y-%m-%d %H:%M:%S'))
    else:
        print(f"keep {f}")
        print(datetime.fromtimestamp(f.stat().st_mtime).strftime('%Y-%m-%d %H:%M:%S'))
        s.add(mins)

CodePudding user response:

I gave this a try. As I understood it you want to save the latest file only. Why would you have to specify the minutes? It is enough to count time since last change in seconds.

My code has a lot of comments that hopefully clarifies my logic. But roughly:

  • Find all files and calculate time since last save
  • Add filename and time to a dict
  • find min value in dict (= lastest file) and delete all other files

Hope this helps

import os
import time

fileDir = '/path/to/files'
time_dict = {}

# loop through files in dir
for file in sorted(os.listdir(fileDir)):
    # find time since last save
    time_since_change = int(time.time() - os.path.getmtime(file))

    # if-statement in case you have your files in the sam dir as your code
    if '.py' not in file:
        # save filename & time since last save into dict
        time_dict[file] = time_since_change

# prints dict just to check that I later will delete the correct file
print(time_dict)

# loop through dict, might not be necessary
for k,v in list(time_dict.items()):
    # if value not min since last save == if file lastest saved
    if v != min(time_dict.values()):
        print("remove file: ", k, "\t", v)

        # to uncomment when you actually want to test the deletion:
        # os.remove(k)
        # del time_dict[k]
  • Related