This is a tough problem. I have around 1,000 images stored in my disk, and I want to find images that are similar to each other by comparing them in pairs. So I have to do around 1,000 * 999 / 2 = 499,500 comparisons (the property of "being similar" is not transitive). My problem is not related with how to compare the images, but with how to manage efficiently the memory of my machine during the comparisons. I have already implemented the comparison function:
static bool AreSimilar(ImageInfo x, ImageInfo y)
{
// Logic
}
...where ImageInfo
is a class that holds the information for one image:
class ImageInfo : IDisposable
{
public string Path { get; init; }
public System.Drawing.Image Image { get; init; }
public void Dispose() => Image.Dispose();
}
Ideally I would like to load all 1,000 images in memory, and then do a nested loop and invoke the AreSimilar
method for each pair, but the memory needed for loading all of them at once exceeds by far the available memory of my machine. The image files are quite large, and their size varies considerably (most of them have sizes between 5 and 50 MB). The available RAM is 2 GB, so I can't have more than ~80 images loaded at the same time. Loading an image form the disk is quite slow. It is actually a lot slower to load two images from the disk, than to compare them
and find whether they are similar.
My question is how can I implement a method that will have the responsibility of loading/unloading the images from the disk, and yielding them in pairs, while taking advantage of all the available memory, but without exceeding the memory limit. Here is the signature of the method that I am trying to implement:
static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
IReadOnlyList<TSource> source,
Func<TSource, long> sizeSelector,
Func<TSource, TItem> itemLoader,
long maxConcurrentSize) where TItem : IDisposable;
The TSource
will be the path of the file, and the TItem
will be an ImageInfo
. I am intending to use it like this:
string[] paths = Directory.GetFiles(@"C:\Images", "*.jpg");
var pairs = GetPairs(paths,
path => new FileInfo(path).Length,
path => new ImageInfo() { Path = path, Image = Image.FromFile(path) },
2_000_000_000);
foreach (var (x, y) in pairs)
{
if (AreSimilar(x, y))
Console.WriteLine($"{x.Path} and {y.Path} are similar!");
}
I am currently out of ideas of how to implement this method. It looks like a serious undertaking. All I have right now is the simple version below, that loads the images in pairs and ignores the sizeSelector
and maxConcurrentSize
parameters:
static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
IReadOnlyList<TSource> source,
Func<TSource, long> sizeSelector,
Func<TSource, TItem> itemLoader,
long maxConcurrentSize) where TItem : IDisposable
{
for (int i = 0; i < source.Count; i )
{
using var first = itemLoader(source[i]);
for (int j = i 1; j < source.Count; j )
{
using var second = itemLoader(source[j]);
yield return (first, second);
}
}
}
Obviously the performance is terrible, since each image is loaded ~500 times on average.
CodePudding user response:
Here is a solution that works for your problem, with an included unit test. Unfortunately it performs poorly in the case where only small numbers of images are able to load at a time, resulting in at worst 2x the number of loads as your proposed solution. However, when a large number of images can be loaded at a time, this algorithm begins to outperform your algorithm, limiting towards 1 load per image as the allowable memory size increases.
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace UnitTest;
[TestClass]
public class TestComparison
{
[TestMethod]
public void Test()
{
const int imageCount = 10;
var (totalLoads, max, pairs) = RunTest(1, 5, 1000, imageCount);
Assert.AreEqual(10, totalLoads);
Assert.AreEqual(1, max);
(_, max, var pairs2) = RunTest(5, 5, 12, imageCount);
Assert.AreEqual(9, max);
var expectedPairs = (imageCount - 1) * imageCount / 2;
Assert.AreEqual(expectedPairs, pairs.Count);
Assert.AreEqual(expectedPairs, pairs2.Count);
}
private (int totalLoads, int maxLoads, List<(ImageInfo, ImageInfo)> pairs) RunTest(int minImageSize, int maxImageSize, long memorySize, int imageCount)
{
var images = GenerateImages(imageCount, minImageSize, maxImageSize);
var paths = images.Keys.ToList();
var imageLoadCounts = new Dictionary<string, int>();
var pairs = GetPairs(paths,p=>images[p].Size,LoadImage,memorySize).ToList();
var totalLoads = imageLoadCounts.Values.Sum();
var maxLoad = imageLoadCounts.Values.Max();
return new(totalLoads, maxLoad,pairs);
ImageInfo LoadImage(string path)
{
if (!imageLoadCounts.TryGetValue(path, out int count))
{
count = 0;
}
count ;
imageLoadCounts[path] = count;
return images[path];
}
}
private Dictionary<string, ImageInfo> GenerateImages(int imageCount, int minImageSize, int maxImageSize)
{
var images = new Dictionary<string,ImageInfo>();
for (int i = 0; i < imageCount; i )
{
images[RandomString()] = new() { Value = _random.NextSingle(), Size = _random.Next(minImageSize, maxImageSize) };
}
return images;
}
class ImageInfo:IDisposable
{
public int Size { get; set; }
public float Value { get; set; }
public void Dispose()
{
}
}
private static readonly Random _random = new();
static string RandomString()
{
const int length = 10;
var str = string.Empty;
for (int i = 0; i < length; i )
{
str = (char)_random.Next(65, 90);
}
return str;
}
static bool AreSimilar(ImageInfo x, ImageInfo y) => Math.Abs(y.Value-x.Value)<.1f;
record Comparison<T>(T Path1, T Path2)
{
public bool Contains(T path) => Path1.Equals(path) || Path2.Equals(path);
public T Other(T path)
{
if (Path1.Equals(path)) return Path2;
if (Path2.Equals(path)) return Path1;
throw new Exception("invalid path");
}
public bool IsEquivalentTo(Comparison<T> other) => (other.Path1.Equals(Path1) && other.Path2.Equals(Path2)) ||
(other.Path2.Equals(Path1) && other.Path1.Equals(Path2));
}
static IEnumerable<(TItem, TItem)> GetPairs<TSource, TItem>(
IReadOnlyList<TSource> source,
Func<TSource, long> sizeSelector,
Func<TSource, TItem> itemLoader,
long maxConcurrentSize) where TItem : IDisposable
{
var itemCount = source.Count;
var comparisons = new List<Comparison<TSource>>(itemCount * itemCount / 2);
for (int i = 0; i < itemCount - 1; i )
{
for (int j = i 1; j < itemCount; j )
{
comparisons.Add(new(source[i], source[j]));
}
}
return GetPairs(comparisons,sizeSelector,itemLoader,maxConcurrentSize);
}
static IEnumerable<(TItem, TItem)> GetPairs<TSource,TItem>(List<Comparison<TSource>> remainingComparisons,
Func<TSource, long> sizeSelector,
Func<TSource, TItem> itemLoader,
long maxConcurrentSize) where TItem:IDisposable
{
if(!remainingComparisons.Any()) yield break;
var images = LoadImages(remainingComparisons,sizeSelector,itemLoader,maxConcurrentSize);//load as many images as possible from the remaining comparisons
var imageCount = images.Count;
for (int i = 0; i < imageCount - 1; i )
{
var (path1, image1) = images[i];
for (int j = i 1; j < imageCount; j )
{
var (path2, image2) = images[j];
yield return new(image1, image2);
var comparison = new Comparison<TSource>(path1, path2);
remainingComparisons.RemoveAll(c => c.IsEquivalentTo(comparison));
}
}
//dispose
foreach (var image in images.Select(i => i.Image))
{
image.Dispose();
}
images = null;//allow GC
foreach (var pair in GetPairs(remainingComparisons,sizeSelector,itemLoader,maxConcurrentSize))
{
yield return pair;
}
}
/// <summary>
/// Loads as many images into memory as possible from the remaining comparison list
/// </summary>
static List<(TSource Path, TITem Image)> LoadImages<TSource,TITem>(List<Comparison<TSource>> remainingComparisons, Func<TSource, long> sizeSelector,
Func<TSource, TITem> itemLoader,
long maxConcurrentSize)
{
var availableMemory = maxConcurrentSize;
remainingComparisons = remainingComparisons.ToList();//copy the list so we can alter it in local scope
var loadedImages = new List<(TSource Path, TITem Image)>();
if (!TryGetSeed(out var seed)) throw new Exception("One of the images is too large to load into memory");
while (remainingComparisons.Any())
{
if (!remainingComparisons.TryGetFirst(c => c.Contains(seed),out var comparison ))
{
if (!TryGetSeed(out seed)) break;
continue;
}
var other = comparison.Other(seed);
remainingComparisons.Remove(comparison);
if (!TryLoad(other)) break;
}
return loadedImages;
bool TryLoad(TSource path)
{
var fileSize = sizeSelector(path);
if (fileSize > availableMemory) return false;
loadedImages.Add(new(path, itemLoader(path)));
availableMemory -= fileSize;
return true;
}
bool TryGetSeed(out TSource seed)
{
//first, remove any comparisons that are relevant to the current loaded list
var loadedImageCount = loadedImages.Count;
for (int i = 0; i < loadedImageCount-1; i )
{
for (int j = 1; j < loadedImageCount; j )
{
var localComparison = new Comparison<TSource>(loadedImages[i].Path, loadedImages[j].Path);
remainingComparisons.RemoveAll(c => c.IsEquivalentTo(localComparison));
}
}
if (!remainingComparisons.TryGetFirst(out var firstRemaining))
{
seed = default;
return false;
}
seed = firstRemaining.Path1;
return TryLoad(seed);
}
}
}
public static class Extensions
{
public static bool TryGetFirst<T>(this IEnumerable<T> items, out T value) =>
items.TryGetFirst(t => true, out value);
public static bool TryGetFirst<T>(this IEnumerable<T> items, Predicate<T> condition, out T value)
{
foreach (var item in items)
{
if (condition(item))
{
value = item;
return true;
}
}
value = default;
return false;
}
}
For the sake of answering your question, time complexity was ignored. The current implementation of TryGetSeed
makes the time complexity O(N^3) but this can easily be improved. I suspect the same algorithm could be written in O(N^2) time, which is the best possible time complexity for this problem.