Home > OS >  Inconsistent outcome when trying to fetch all items from a paginated url
Inconsistent outcome when trying to fetch all items from a paginated url

Time:07-17

I am having the issue of fetching all forms currently hosted in hubspot.

I tried with a simple for loop where I made one request at the time, and fetched one form at a time, which worked, but was very slow.

I then thought it might work better if I created a seperate task for each request, and then made the task create each request, and store them in one common list.

Problem is that I expect the list to have 2000 items, but I never seem to get that, it seem pretty inconsistent the number of items I get?

But how come?

This is how I have setup my for fetching scheme.

private static async Task<IEnumerable<HubspotModel>> GetForms(
    string hubspotPath, int pageSize)
{
    int totalResults;
    int offset = 0;
    List<HubspotModel> output = new();
    List<Task> tasks = new();
    using var client = new HttpClient();
    {
        System.Net.Http.Headers.HttpResponseHeaders requestHeader = client
            .GetAsync($"https://api.hubapi.com{hubspotPath}?"  
                $"hapikey={HubspotConfiguration.ApiKey}&limit={1}&offset={0}")
            .Result.Headers;
        totalResults = int.Parse(requestHeader.GetValues("x-total-count").First());
        do
        {
            tasks.Add(Task.Run(() =>
            {
                int scopedOffset = offset;
                IEnumerable<HubspotModel> forms = GetFormsFromHubspot(hubspotPath,
                    pageSize, offset, client);
                output.AddRange(forms);

            }).ContinueWith(requestReponse =>
            {
                if (requestReponse.IsFaulted)
                {
                    Console.WriteLine("it failed");
                }
            }));
            offset  = pageSize;
        }
        while (totalResults > offset);

        await Task.WhenAll(tasks);
    }

    return output;
}

private static IEnumerable<HubspotModel> GetFormsFromHubspot(string hubspotPath,
    int pageSize, int offset, HttpClient client)
{
    HttpResponseMessage request = client
        .GetAsync($"https://api.hubapi.com{hubspotPath}?"  
            $"hapikey={HubspotConfiguration.ApiKey}&limit={pageSize}&offset={offset}")
        .Result;
    request.EnsureSuccessStatusCode();
    string content = request.Content.ReadAsStringAsync().Result;
    IEnumerable<Dictionary<string, object>> jsonResponse = JsonSerializer
        .Deserialize<IEnumerable<Dictionary<string, object>>>(content,
        new JsonSerializerOptions() { });
    var guid = Guid.Parse(jsonResponse.First()["guid"].ToString());
    var forms = jsonResponse.Select(x => new HubspotModel()
    {
        id = Guid.Parse(x["guid"].ToString()),
        FormName = x["name"].ToString(),
        Form = x
    });
    return forms;
}

CodePudding user response:

First of all, I'd suggest to make GetFormsFromHotspot an async as well and use await client.GetAsync( ...) and await request.Content.ReadAsStringAsync() instead of client.GetAsync(...).Result and ReadAsStringAsync().Result respectively, because using .Result will block the current thread and thus, you will throw away the advantages of async Tasks.

But the main cause of the problem should be the following

GetFormsFromHubspot(hubspotPath, pageSize, offset, client);

Here you are calling the GetFormsFromHubspot with an offset parameter from an outer scope (and that value keeps changing), thus it will not use the value it had when you created that task but it uses the value it actually has, when that particular part of the code is really executed. So the value that is used as an offset is quite random. You already tried to create a

int scopedOffset = offset;

but you don't use it. And also you create it at the wrong position. Create that scopedOffset outside of the task, but inside the loop's body. So it will be created at the creationtime of the task. And because it's inside the loop's body, a new value will be created for each task.

The following should do the trick (after you refactor GetFormsFromHubspot to be async.

do {
  int scopedOffset = offset
  tasks.Add(Task.Run(async () => {
      IEnumerable<HubspotModel> forms = await GetFormsFromHubspot(hubspotPath, pageSize, scopedOffset, client);
      output.AddRange(forms);
    })     
    .ContinueWith(...);
  );
  offset  = pageSize;
} while (totalResults > offset);

CodePudding user response:

The main problem with your code is that the List<T> is not thread-safe. When multiple threads are Adding to a list concurrently without synchronization, its behavior becomes undefined (throws exceptions, becomes corrupted etc). There are many ways to solve this problem:

  1. Synchronize the access to the list with the lock statement: lock (output) output.AddRange(forms);.
  2. Use a concurrent collection instead of the List<T>, for example a ConcurrentQueue<T>.
  3. Avoid collecting manually the output altogether. Instead of storing your tasks in a List<Task>, you can store them in a List<Task<HubspotModel[]>>, meaning that each task will be a generic Task<TResult>, with the TResult being an array of HubspotModel instances. Finally you will get all the output at once, when you await the Task.WhenAll.

Below is an implementation of the third idea. Notice that I have avoided creating a HttpClient instance, because the recommendation is to instantiated this class only once, and reuse it throughout the life of the application.

private static async Task<HubspotModel[]> GetFormsAsync(HttpClient client,
    string hubspotPath, int pageSize)
{
    string url = $"https://api.hubapi.com{hubspotPath}?hapikey="  
        $"{HubspotConfiguration.ApiKey}&limit={1}&offset={0}";
    HttpResponseMessage response = await client.GetAsync(url)
        .ConfigureAwait(false);
    response.EnsureSuccessStatusCode();
    int totalCount = Int32.Parse(response.Headers
        .GetValues("x-total-count").First());

    List<int> offsets = new();
    for (int offset = 0; offset < totalCount; offset  = pageSize)
        offsets.Add(offset);

    Task<HubspotModel[]>[] tasks = offsets.Select(offset => Task.Run(async () =>
    {
        HubspotModel[] forms = await GetFormsAsync(client,
            hubspotPath, pageSize, offset).ConfigureAwait(false);
        return forms;
    })).ToArray();

    HubspotModel[][] results = await Task.WhenAll(tasks).ConfigureAwait(false);

    return results.SelectMany(x => x).ToArray();
}

private async static Task<HubspotModel[]> GetFormsAsync(HttpClient client,
    string hubspotPath, int pageSize, int offset)
{
    string url = $"https://api.hubapi.com{hubspotPath}?hapikey="  
        $"{HubspotConfiguration.ApiKey}&limit={pageSize}&offset={offset}";
    HttpResponseMessage response = await client.GetAsync(url)
        .ConfigureAwait(false);
    response.EnsureSuccessStatusCode();

    string content = await response.Content.ReadAsStringAsync()
        .ConfigureAwait(false);
    IEnumerable<Dictionary<string, object>> jsonResponse = JsonSerializer
        .Deserialize<IEnumerable<Dictionary<string, object>>>(content,
        new JsonSerializerOptions() { });

    Guid guid = Guid.Parse(jsonResponse.First()["guid"].ToString());
    HubspotModel[] forms = jsonResponse.Select(x => new HubspotModel()
    {
        Id = Guid.Parse(x["guid"].ToString()),
        FormName = x["name"].ToString(),
        Form = x
    }).ToArray();
    return forms;
}

One more improvement that you could consider doing is to switch from the Task.WhenAll to the new (.NET 6) API Parallel.ForEachAsync. The advantage is that you'll get control over the degree of parallelism, and so you'll be able to reduce the parallelization in case the remote server can't keep up with the pressure. Unfortunatelly the Parallel.ForEachAsync method does not return the results like the Task.WhenAll, so you'll be back to your original problem. You can find a solution about this here: ForEachAsync with Result.

  • Related