Home > Mobile >  Tying to download huge amount of files more efficent in C#
Tying to download huge amount of files more efficent in C#

Time:11-21

i'm trying to download approx. 45.000 Image Files from an API. The Image Files have less than 50kb each. With my Code this will take 2-3 Hours.

Is there an more efficient way in c# to download them?

private static readonly string baseUrl = "http://url.com/Handlers/Image.ashx?imageid={0}&type=image";
    internal static void DownloadAllMissingPictures(List<ListObject> ImagesToDownload, string imageFolderPath)
    {
        Parallel.ForEach(Partitioner.Create(0, ImagesToDownload.Count), range =>
        {
            for (var i = range.Item1; i < range.Item2; i  )
            {
                string ImageID = ImagesToDownload[i].ImageId;

                using (var webClient = new WebClient())
                {
                    string url = String.Format(baseUrl, ImageID);
                    string file = String.Format(@"{0}\{1}.jpg", imageFolderPath, ImagesToDownload[i].ImageId);

                    byte[] data = webClient.DownloadData(url);

                    using (MemoryStream mem = new MemoryStream(data))
                    {
                        using (var image = Image.FromStream(mem))
                        {
                            image.Save(file, ImageFormat.Jpeg);
                        }
                    }                    
                }
            }
        });
    }

CodePudding user response:

Likely not.

One thing to think about though - besides NOT using WebClient as it has been replaced by HttpClient long time ago, you just missed the memo. I suggest a fast run through the documentation.

Regardless what you think you do with Parallel.Foreach - you are limited by the parallel connection settings (ServicePointManager, HttpClientHandler).

You should read the manuals for those and experiment with higher limits, because right now that is quite likely limiting your parallelism to a quite low number and possibly can handle 3-4 times the limit.

https://coderedirect.com/questions/230117/maximum-concurrent-requests-for-webclient-httpwebrequest-and-httpclient

has a deeper explanation.

CodePudding user response:

The Parallel.ForEach method is not well suited for I/O-bound operations, because it requires a thread for each parallel workflow, and threads are not cheap resources. You can make it work by increasing the number of threads that the ThreadPool creates immediately on demand, with the SetMinThreads method, but that's not as efficient as using asynchronous programming and async/await. With asynchronous programming a thread is not required while the file is downloaded, or while the file is saved in the disc, so it is possible to download dozens of files concurrently using only a handful of threads.

Using the Partitioner for creating ranges is a useful technique when parallelizing extremely granular (lightweight) workloads, like adding or comparing numbers. In your case the workload is quite coarse (chunky), so using ranges is more likely to slow things down than speed them up. Using ranges prevents balancing the workload, in case some files take longer to download than others.

My suggestion is to use the Parallel.ForEachAsync method (introduced in .NET 6), which is designed specifically for parallelizing asynchronous I/O operations. Here is how you can use this method in order to download the files in parallel, with a specific degree of parallelism, and cancellation support:

private static readonly string _baseUrlPattern =
    "http://url.com/Handlers/Image.ashx?imageid={0}&type=image";

private static readonly HttpClient _httpClient = new HttpClient();

internal static void DownloadAllMissingPictures(
    IEnumerable<ListObject> imagesToDownload, string imageFolderPath,
    CancellationToken cancellationToken = default)
{
    var parallelOptions = new ParallelOptions()
    {
        MaxDegreeOfParallelism = 10,
        CancellationToken = cancellationToken,
    };
    Parallel.ForEachAsync(imagesToDownload, parallelOptions, async (image, ct) =>
    {
        string imageId = image.ImageId;
        string url = String.Format(_baseUrlPattern, imageId);
        string filePath = Path.Combine(imageFolderPath, imageId);
        using HttpResponseMessage response = await _httpClient.GetAsync(url, ct);
        response.EnsureSuccessStatusCode();
        using FileStream fileStream = File.OpenWrite(filePath);
        await response.Content.CopyToAsync(fileStream);
    }).Wait();
}

The Parallel.ForEachAsync method returns a Task. It's recommended that Tasks are awaited, but taking into account that you are probably not familiar with asynchronous programming yet, let's just Wait it instead for the time being.

In case the implementation above does not improve the performance of the whole procedure, you could experiment with the MaxDegreeOfParallelism configuration, and also with the settings mentioned in this question: How to increase the outgoing HTTP requests quota in .NET Core?

  • Related