i'm trying to download approx. 45.000 Image Files from an API. The Image Files have less than 50kb each. With my Code this will take 2-3 Hours.
Is there an more efficient way in c# to download them?
private static readonly string baseUrl = "http://url.com/Handlers/Image.ashx?imageid={0}&type=image";
internal static void DownloadAllMissingPictures(List<ListObject> ImagesToDownload, string imageFolderPath)
{
Parallel.ForEach(Partitioner.Create(0, ImagesToDownload.Count), range =>
{
for (var i = range.Item1; i < range.Item2; i )
{
string ImageID = ImagesToDownload[i].ImageId;
using (var webClient = new WebClient())
{
string url = String.Format(baseUrl, ImageID);
string file = String.Format(@"{0}\{1}.jpg", imageFolderPath, ImagesToDownload[i].ImageId);
byte[] data = webClient.DownloadData(url);
using (MemoryStream mem = new MemoryStream(data))
{
using (var image = Image.FromStream(mem))
{
image.Save(file, ImageFormat.Jpeg);
}
}
}
}
});
}
CodePudding user response:
Likely not.
One thing to think about though - besides NOT using WebClient as it has been replaced by HttpClient long time ago, you just missed the memo. I suggest a fast run through the documentation.
Regardless what you think you do with Parallel.Foreach - you are limited by the parallel connection settings (ServicePointManager, HttpClientHandler).
You should read the manuals for those and experiment with higher limits, because right now that is quite likely limiting your parallelism to a quite low number and possibly can handle 3-4 times the limit.
has a deeper explanation.
CodePudding user response:
The Parallel.ForEach
method is not well suited for I/O-bound operations, because it requires a thread for each parallel workflow, and threads are not cheap resources. You can make it work by increasing the number of threads that the ThreadPool
creates immediately on demand, with the SetMinThreads
method, but that's not as efficient as using asynchronous programming and async/await. With asynchronous programming a thread is not required while the file is downloaded, or while the file is saved in the disc, so it is possible to download dozens of files concurrently using only a handful of threads.
Using the Partitioner
for creating ranges is a useful technique when parallelizing extremely granular (lightweight) workloads, like adding or comparing numbers. In your case the workload is quite coarse (chunky), so using ranges is more likely to slow things down than speed them up. Using ranges prevents balancing the workload, in case some files take longer to download than others.
My suggestion is to use the Parallel.ForEachAsync
method (introduced in .NET 6), which is designed specifically for parallelizing asynchronous I/O operations. Here is how you can use this method in order to download the files in parallel, with a specific degree of parallelism, and cancellation support:
private static readonly string _baseUrlPattern =
"http://url.com/Handlers/Image.ashx?imageid={0}&type=image";
private static readonly HttpClient _httpClient = new HttpClient();
internal static void DownloadAllMissingPictures(
IEnumerable<ListObject> imagesToDownload, string imageFolderPath,
CancellationToken cancellationToken = default)
{
var parallelOptions = new ParallelOptions()
{
MaxDegreeOfParallelism = 10,
CancellationToken = cancellationToken,
};
Parallel.ForEachAsync(imagesToDownload, parallelOptions, async (image, ct) =>
{
string imageId = image.ImageId;
string url = String.Format(_baseUrlPattern, imageId);
string filePath = Path.Combine(imageFolderPath, imageId);
using HttpResponseMessage response = await _httpClient.GetAsync(url, ct);
response.EnsureSuccessStatusCode();
using FileStream fileStream = File.OpenWrite(filePath);
await response.Content.CopyToAsync(fileStream);
}).Wait();
}
The Parallel.ForEachAsync
method returns a Task
. It's recommended that Task
s are await
ed, but taking into account that you are probably not familiar with asynchronous programming yet, let's just Wait
it instead for the time being.
In case the implementation above does not improve the performance of the whole procedure, you could experiment with the MaxDegreeOfParallelism
configuration, and also with the settings mentioned in this question: How to increase the outgoing HTTP requests quota in .NET Core?