so basically I want to learn a bit of webscraping, and I was excepting that these code-blocks returns the same HTML (as a string).
var result = await client.GetAsync("https://www.footlocker.at/de/product/nike-dunk-high-damenschuhe/315347028102.html");
var stream = result.Content.ReadAsStringAsync().Result;
Console.WriteLine(stream);
var result2 = await client.GetStreamAsync("https://www.footlocker.at/de/product/nike-dunk-high-damenschuhe/315347028102.html");
var stream2 = new StreamReader(result2).ReadToEndAsync().Result;
Console.WriteLine(stream2);
In this type of example, GetStreamAsync() returns the whole html of the website and GetAsync returns also an HTML but it's just an error site. Why is that so? Am I not understanding the difference between these Methods, or just doing something complety wrong?
I tried to look for answers on stackoverflow and read through the documentation, but unfortunately didn't find any specific answer for this example.
CodePudding user response:
As explained by @RichardDeeming, HttpClient.GetStreamAsync call HttpClient.GetAsync with HttpCompletionOption.ResponseHeadersRead.
The code can be rewritten as follows:
async Task Main()
{
HttpClient httpclient = new HttpClient();
var imageUrl = "https://tenlives.com.au/wp-content/uploads/2020/09/Found-Kitten-0-8-Weeks-Busy-scaled.jpg";
var downloadTasks = Enumerable.Range(0, 15)
.Select(async u =>
{
try
{
//Option 1) - this will fail
var response = await httpclient.GetAsync(imageUrl, HttpCompletionOption.ResponseHeadersRead);
//End Option 1)
//Option 2) - this will succeed
//var response = await httpclient.GetAsync(imageUrl, HttpCompletionOption.ResponseContentRead);
//End Option 2)
response.EnsureSuccessStatusCode();
var stream = await response.Content.ReadAsStreamAsync();
return stream;
}
catch (Exception e)
{
Console.WriteLine($"Error downloading image");
throw;
}
}).ToList();
try
{
await Task.WhenAll(downloadTasks);
}
catch (Exception e)
{
Console.WriteLine("================ Failed to download one or more image " e.Message);
}
Console.WriteLine($"Successful downloads: {downloadTasks.Where(t => t.Status == TaskStatus.RanToCompletion).Count()}");
With HttpClient.GetAsync (or SendAsync(HttpCompletionOption.ResponseContentRead)), the received content in a network buffer is read and integrated in a local buffer.
I'm not sure about the network buffer, but I think a buffer somewhere (network card, OS, HttpClient, ???) is full and blocks new responses.
You can correct the code by correctly managing this buffer, for example by disposing the associated streams :
var downloadTasks = Enumerable.Range(0, 15)
.Select(async u =>
{
try
{
var stream = await httpclient.GetStreamAsync(imageUrl);
stream.Dispose(); //Free buffer
return stream;
}
catch (Exception e)
{
Console.WriteLine($"Error downloading image");
throw;
}
}).ToList();
These two example help you to find difference.