Given a list of filenames, group filenames with all derivatives based on similarity of the filename-CodePudding

I might not be searching the best terms to find a solution but so far nothing I've found has been able to solve my problem and I really don't know where to start or even what mechanisms to investigate.

I have a large list of image files in various locations on my hard drive and I'm trying to clean it up by removing the duplicates. Most of these are easy to find using hash codes but I have a lot of corrupted or edited versions which aren't so easy to find. I know I'll need some user interaction to identify and delete (archive) the unwanted files and I'll be doing some further processing to make sure metadata such as dates and geotagging are correct (also used to potentially match files) and then display similar images with all known data through a simple html interface.

One of the steps I've identified is grouping similarly named files or files which have part of another filename in its name. Sometimes these can be completely unrelated and so the user interaction will be required.

Below is a sample of files, what I would like is to group them into filenames which are similar, disregarding path and file extension.

[
"/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
"/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
"/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
"/Users/stu/Photos/2013/IMAG0097.jpg",
"/Users/stu/Photos/2014/IMAG0097.jpg",
"/Users/stu/Photos/2013/IMAG0126.jpg",
"/Users/stu/Photos/Holidays/IMAG0132.jpg",
"/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg",
"/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg",
"/Users/stu/Downloads/Photos/IMG_20140412_195105.png",
"/Users/stu/Photos/2014/IMG_20140412_195110.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
"/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245.png",
"/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
"/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
"/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
"/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png",
"/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
"/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg",
"/Users/stu/Photos/2013/IMAG0126-edited.jpg",
"/Users/stu/Photos/2013/IMAG0126546.jpg"
]

The list of files above should output something like this:

{
    "IMG_20140413_072335": [
        "/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
        "/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
        "/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
        "/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
        "/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
        "/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
        "/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
        "/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
        "/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
        "/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
        "/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
        "/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg"
    ],
    "IMAG0097": [
        "/Users/stu/Photos/2013/IMAG0097.jpg",
        "/Users/stu/Photos/2014/IMAG0097.jpg"
    ],
    "IMAG0126": [
        "/Users/stu/Photos/2013/IMAG0126.jpg",
        "/Users/stu/Photos/2013/IMAG0126-edited.jpg",
        "/Users/stu/Photos/2013/IMAG0126546.jpg"
    ],
    "IMAG0132": [
        "/Users/stu/Photos/Holidays/IMAG0132.jpg"
    ],
    "IMG_20140322_142557": [
        "/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg"
    ],
    "IMG_20140330_200132": [
        "/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg"
    ],
    "IMG_20140412_195105": [
        "/Users/stu/Downloads/Photos/IMG_20140412_195105.png"
    ],
    "IMG_20140412_195110": [
        "/Users/stu/Photos/2014/IMG_20140412_195110.png"
    ],
    "IMG_20140413_143245": [
        "/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
        "/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
        "/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
        "/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",           
        "/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
        "/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
        "/Users/stu/Photos/2014/IMG_20140413_143245.png",
        "/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
        "/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
        "/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
        "/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
        "/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
        "/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
        "/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
        "/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
        "/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
        "/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png"
    ]
}

Any ideas how to do this in Python3?

Thanks

Edit: I just added a few more examples to the sample set of filenames.

CodePudding user response：

the following worked for me:

from pprint import pprint
d = dict()

for i in t:
    tmp = os.path.basename(i).split(".")[0] # if file with extension given return the name before "."
                                            # else return the base name, without changes

    k = tmp.split("(")[0]                   # the (..) is a typical windows signiture for simillar names
                                            # if so split and take the name before it

    d.setdefault(k,[])                      # the line reassures the uniquenes of the records
    if k in tmp:
        d[k].append(i)

# SENTINEL
if sum([len(i) for i in d.values()]) !=len(t):
    raise ValueError("The sanity check wasn't successful !")

pprint(d)

RESULT:

{'IMAG0097': ['/Users/stu/Photos/2013/IMAG0097.jpg',
              '/Users/stu/Photos/2014/IMAG0097.jpg'],
 'IMAG0126': ['/Users/stu/Photos/2013/IMAG0126.jpg'],
 'IMAG0132': ['/Users/stu/Photos/Holidays/IMAG0132.jpg'],
 'IMG_20140322_142557-edited': ['/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg'],
 'IMG_20140330_200132': ['/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg'],
 'IMG_20140412_195105': ['/Users/stu/Downloads/Photos/IMG_20140412_195105.png'],
 'IMG_20140412_195110': ['/Users/stu/Photos/2014/IMG_20140412_195110.png'],
 'IMG_20140413_072335': ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_072335.jpg',
                         '/Users/stu/Documents/Backup/IMG_20140413_072335(4).png',
                         '/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg',
                         '/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg'],
 'IMG_20140413_143245': ['/Users/stu/Photos/2014/IMG_20140413_143245(6).png',
                         '/Users/stu/Photos/2014/IMG_20140413_143245(7).png',
                         '/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg',
                         '/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg',
                         '/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_143245.png',
                         '/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg',
                         '/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg',
                         '/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png',
                         '/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg',
                         '/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg',
                         '/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg',
                         '/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png']}

CodePudding user response：

It appears that your pictures are identified by what is between the last 'G' (from 'IMG' or 'IMAG') and the next '.' or '(' or '-'.

Using that portion of the strings as a key, we can easily group filenames into a dict of lists.

files = ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335.jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(4).png', '/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg', '/Users/stu/Photos/2013/IMAG0097.jpg', '/Users/stu/Photos/2014/IMAG0097.jpg', '/Users/stu/Photos/2013/IMAG0126.jpg', '/Users/stu/Photos/Holidays/IMAG0132.jpg', '/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg', '/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg', '/Users/stu/Downloads/Photos/IMG_20140412_195105.png', '/Users/stu/Photos/2014/IMG_20140412_195110.png', '/Users/stu/Photos/2014/IMG_20140413_143245(6).png', '/Users/stu/Photos/2014/IMG_20140413_143245(7).png', '/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245.png', '/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png', '/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png', '/Users/stu/Photos/2013/IMG_20140413_072335.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg', '/Users/stu/Photos/2013/IMAG0126-edited.jpg', '/Users/stu/Photos/2013/IMAG0126546.jpg']

def photo_id(filename):
    i = filename.rfind('G')   1
    j1 = filename.find('.', i)
    j2 = filename.find('(', i)
    j3 = filename.find('-', i)
    j = min(j for j in (j1,j2,j3,len(filename)) if j > -1)
    return filename[i:j]

photos = {}
for filename in files:
    photos.setdefault(photo_id(filename), []).append(filename)

print(photos)
# {'_20140413_072335': ['/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335.jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(4).png', '/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg', '/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg', '/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg', '/Users/stu/Photos/2013/IMG_20140413_072335.jpg', '/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg'],
#  '0097': ['/Users/stu/Photos/2013/IMAG0097.jpg', '/Users/stu/Photos/2014/IMAG0097.jpg'],
#  '0126': ['/Users/stu/Photos/2013/IMAG0126.jpg', '/Users/stu/Photos/2013/IMAG0126-edited.jpg'],
#  '0132': ['/Users/stu/Photos/Holidays/IMAG0132.jpg'],
#  '_20140322_142557': ['/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg'],
#  '_20140330_200132': ['/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg'],
#  '_20140412_195105': ['/Users/stu/Downloads/Photos/IMG_20140412_195105.png'],
#  '_20140412_195110': ['/Users/stu/Photos/2014/IMG_20140412_195110.png'],
#  '_20140413_143245': ['/Users/stu/Photos/2014/IMG_20140413_143245(6).png', '/Users/stu/Photos/2014/IMG_20140413_143245(7).png', '/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245.png', '/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg', '/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png', '/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg', '/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg', '/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png'],
#  '_20140413_072335_01': ['/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg'],
#  '_20140413_072335_9352': ['/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg'],
#  '0126546': ['/Users/stu/Photos/2013/IMAG0126546.jpg']}

CodePudding user response：

So I worked out a solution that gives me what I'm after. Not sure that it's the very best way to solve this but certainly does the job.

Firstly, I created a dict with full path as the key and filename minus extension as the value. This is then sorted by value length so that as I iterate through I'm able to start with shorter values and work up. Then I simply iterate through and check against all lower entities looking for a match within the string and grouping matches together. I've also allowed for small filenames by comparing lengths of values and only matching if the threshold is reached (0.5 in the example below).

import os
from pprint import pprint

files = [
    "/Users/stu/Photos/2014/IMG_20140413_072335(2).jpg",
    "/Users/stu/Photos/2014/IMG_20140413_072335(3).jpg",
    "/Users/stu/Photos/2014/IMG_20140413_072335.jpg",
    "/Users/stu/Documents/Backup/IMG_20140413_072335(4).png",
    "/Users/stu/Documents/Backup/IMG_20140413_072335(5).jpg",
    "/Users/stu/Documents/Backup/IMG_20140413_072335(6).jpg",
    "/Users/stu/Photos/2014/IMG_20140413_072335(7).jpg",
    "/Users/stu/Photos/2014/IMG_20140413_072335(1).jpg",
    "/Users/stu/Photos/2013/IMAG0097.jpg",
    "/Users/stu/Photos/2014/IMAG0097.jpg",
    "/Users/stu/Photos/2013/IMAG0126.jpg",
    "/Users/stu/Photos/Holidays/IMAG0132.jpg",
    "/Users/stu/Photos/2014/IMG_20140322_142557-edited.jpg",
    "/Users/stu/Downloads/Photos/IMG_20140330_200132.jpg",
    "/Users/stu/Downloads/Photos/IMG_20140412_195105.png",
    "/Users/stu/Photos/2014/IMG_20140412_195110.png",
    "/Users/stu/Photos/2014/IMG_20140413_143245(6).png",
    "/Users/stu/Photos/2014/IMG_20140413_143245(7).png",
    "/Users/stu/Photos/2014/IMG_20140413_143245(1).jpg",
    "/Users/stu/Downloads/Photos/IMG_20140413_143245(2).jpg",
    "/Users/stu/Downloads/Photos/IMG_20140413_143245(11).jpg",
    "/Users/stu/Photos/2014/IMG_20140413_143245(10).jpg",
    "/Users/stu/Photos/2014/IMG_20140413_143245.png",
    "/Users/stu/Photos/2014/IMG_20140413_143245(3).jpg",
    "/Users/stu/Photos/2014/IMG_20140413_143245(8).jpg",
    "/Users/stu/Photos/2014/IMG_20140413_143245(4).jpg",
    "/Users/stu/Downloads/Photos/IMG_20140413_143245(5).jpg",
    "/Users/stu/Downloads/Photos/IMG_20140413_143245(9).png",
    "/Users/stu/Downloads/Photos/IMG_20140413_143245(3)-edited.jpg",
    "/Users/stu/Photos/2014/IMG_20140413_143245(8)-edited.jpg",
    "/Users/stu/Photos/Holidays/IMG_20140413_143245(4)-edited.jpg",
    "/Users/stu/Photos/Holidays/IMG_20140413_143245(5)-edited.jpg",
    "/Users/stu/Photos/2014/IMG_20140413_143245(9)-edited.png",
    "/Users/stu/Photos/2013/IMG_20140413_072335.jpg",
    "/Users/stu/Photos/2014/IMG_20140413_072335_01.jpg",
    "/Users/stu/Photos/2014/IMG_20140413_072335_9352.jpg",
    "/Users/stu/Photos/2014/IMG_20140413_072335-9237.jpg",
    "/Users/stu/Photos/2013/IMAG0126-edited.jpg",
    "/Users/stu/Photos/20.png",
    "/Users/stu/Photos/203.png",
    "/Users/stu/Photos/2021.png",
    "/Users/stu/Photos/2021q.png",
    "/User/2.jpg"
]

relevanceFactor = 0.5

rawFiles = {}

for file in files:
    rawFiles[file] = os.path.splitext(os.path.basename(file))[0]

sortedFiles = sorted(rawFiles.items(), key=lambda kv: (len(kv[1]), kv[0]))

alreadyGrouped = []
groupedFiles = {}

for i, file in enumerate(sortedFiles):
    fullPath = file[0]
    cluster = file[1]
    if cluster not in alreadyGrouped:
        groupedFiles[cluster] = [fullPath]
        for compareFile in sortedFiles[i 1:]:
            compareFullPath = compareFile[0]
            compareCluster = compareFile[1]
            if len(cluster)/len(compareCluster) < relevanceFactor:
                break
            if (compareCluster not in alreadyGrouped
                and cluster in compareCluster):
                alreadyGrouped.append(compareCluster)
                groupedFiles[cluster].append(compareFullPath)
        if cluster not in alreadyGrouped:
            alreadyGrouped.append(cluster)

pprint(groupedFiles)