Making multiple http calls within a foreach that iterates through thousands of records-CodePudding

I am implementing a webjob in .NET that will have to take care of importing about 25.000 users from our database to a new one managed with IdentityServer not handled by us.

This operation will be very expensive in terms of http calls, because for each user I have to make these calls:

user creation;
assigning any roles to the created user (which could be more than one so there would be yet another iteration);
assignment of (I know for sure) at least two claims (another iteration);
change user password.

I can't handle it any other way because the owners of the instance I'm importing users on have defined these as the steps to take for each user.

This is what I thought: this is my "entry" point:

internal async static Task<WorkerResponse> SendUsersAsync(string token, IDictionary<int, UserToSend> usersToSend, ICollection<Roles> roles, IMapper mapper, TextWriter logger = null)
{
    string userName = string.Empty;
    try
    {
        foreach (KeyValuePair<int, UserToSend> userToSend in usersToSend)
        {
            int externalId = userToSend.Key;
            userName = userToSend.Value.UserName;
            int fK_Soggetto = userToSend.Value.FK_Soggetto;
            logger?.Write($"Sending user {userName} with (External)Id {externalId}");
            logger?.WriteLine(string.Empty);

            UserToPost userToPost = mapper.Map<UserToPost>(userToSend.Value);

            (bool isSuccess, string messageOrUserId) = await SendUserToAuthorityAsync(token, userToPost);
            if (!isSuccess)
                return new WorkerResponse(isSuccess: false, message: messageOrUserId);

            logger?.Write($"User {userName} sent.");
            logger?.WriteLine(string.Empty);

            if (userToSend.Value.ConsulenzaRoles?.Count > 0)
            {
                logger?.Write($"Appending roles for user {userName} with id {messageOrUserId}");
                logger?.WriteLine(string.Empty);
                (bool isSuccessRoles, string messageRoles) = await SendUserRolesToAuthorityAsync(
                                                                    token,
                                                                    SendingUserHelper.GetUserRolesToPost(userToSend.Value.ConsulenzaRoles, roles, messageOrUserId),
                                                                    userName,
                                                                    logger);

                if (!isSuccessRoles)
                    return new WorkerResponse(isSuccess: false, message: messageRoles);
            }

            logger?.Write($"Appending claims for user {userName} with id {messageOrUserId}");
            logger?.WriteLine(string.Empty);
            ICollection<UserClaimToPost> userClaimsToPost = SendingUserHelper.GetUserClaimsToPost(messageOrUserId, externalId, fK_Soggetto);
            (bool isSuccessClaims, string msg) = await SendUserClaimsToAuthorityAsync(token, userClaimsToPost, userName, logger);
            if (!isSuccessClaims)
                return new WorkerResponse(isSuccess: false, message: msg);
        }
    }
    catch (BusinessLogicException ex)
    {
        return new WorkerResponse(isSuccess: false, message: $"{ex.Message} {ex.InnerException?.Message}");
    }

    return new WorkerResponse(isSuccess: true, message: $"user {userName} successfully added");
}

Where inside each method (send user, send role etc) the structure is more or less this for all methods (using (HttpClient httpClient = new HttpClient()) mostly):

private async static Task<(bool, string)> SendUserToAuthorityAsync(string token, UserToPost userToPost, TextWriter logger = null)
{
    try
    {
        logger?.WriteLine($"Attempting to request auth to send User {userToPost.UserName}...");
        logger?.WriteLine(string.Empty);

        IdentityServerUser userResponse;
        using (HttpClient httpClient = new HttpClient())
        {
            httpClient.BaseAddress = new Uri(AdminAuthority);
            httpClient.DefaultRequestHeaders.Accept.Clear();
            httpClient.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
            httpClient.DefaultRequestHeaders.Add("Authorization", $"Bearer {token}");

            string bodyString = JsonConvert.SerializeObject(userToPost);
            byte[] buffer = Encoding.UTF8.GetBytes(bodyString);
            ByteArrayContent byteContent = new ByteArrayContent(buffer);
            byteContent.Headers.ContentType = new MediaTypeHeaderValue("application/json");

            using (HttpResponseMessage responseMessage = await httpClient.PostAsync(IdentityServerEndpoint.PostUsers, byteContent))
            {
                responseMessage.EnsureSuccessStatusCode();

                userResponse = JsonConvert.DeserializeObject<IdentityServerUser>(await responseMessage.Content.ReadAsStringAsync());
                if (userResponse?.IsOk == false)
                    return (false, $"Error deserializing user {userToPost.UserName} from content string to abstracted model");
            }
        }

        if (userResponse?.Id == default)
            return (false, $"Error deserializing user {userToPost.UserName} from content string to abstracted model");

        return (true, $"{userResponse.Id}");
    }
    catch (Exception ex)
    {
        return (false, $"Error sending user '{userToPost.UserName}'.\n{ex.Message}{ex.InnerException?.Message}");
    }
}

I was wondering if there are smarter ways to make this calls within the foreach. For example I am not sure if the using of HttpClient are safe, and also if it was better to think also of a "lazy" system that sends users a little to the time without making thousands of calls in no time. Thank you

CodePudding user response：

Maybe you can group them to send less request for each time.

These is just a basic draft to give an idea ,you should improve grouping style

     List<User> users = new List<User>();
      users.Where(u => u.Id > 0);
      users.Where(u => u.Id > 5000);
      users.Where(u => u.Id > 10000);
      users.Where(u => u.Id > 20000);

     List<List<User>> allUsers = new List<List<User>>();

      //add groups above after Where(); into allUsers

 async HttpResponse PostUserAsync() 
       {
           //send request here 
       }

       foreach (var userGroup in allUsers)
       {
           userGroup.ForEach(item => PostUserAsync(item));
           // and mark every user In Db after PostRequest succeeded
       }

CodePudding user response：

I have several recommendations.

Use a single HttpClient instance for the whole job. HttpClient is designed to be used by multiple threads, and creating too many instances (even if you're disposing them properly) causes issues.

Make sure you have the list of users persisted somewhere, so you're not re-querying for the whole list of users when you attempt to import them. Break the list down into groups so you can catch common issues while they have a small impact. For example, import 1 user, and have them make sure they have the permissions they're expecting. Then import 10 users and visually verify that they seem to have the permissions they need. Then import 100 users and make sure you aren't seeing any unexpected errors. Then 1000. Then the remaining users.

Think carefully about what you really want to have happen if only one user fails to get sent. Do you really want to prevent all the other users from getting sent? One common approach is to save information about the failed users into a "dead-letter" file, where you can analyze them and easily re-run your logic for just those users after fixing the root problem.

Think carefully about what should happen if only part of the process succeeds for a given user. Ideally you would write your process to be idempotent, so you can run the whole process again and the steps that have already happened don't cause problems. For example, make sure that you only create the user if the user doesn't already exist in the source system, and ensure that assigning roles to the user has no effect if that user already has those roles.

Before worrying about concurrency, think about how much of a difference it will make. If you manage to import 1000 users in 30 minutes without optimizations, then you'll probably be able to complete the remaining 24,000 in about 12 hours. If this is a one-time job, it might be worthwhile to keep things simple and just let that job run for 12 hours rather than spending hours of your own development time trying to optimize it and risking the introduction of more bugs.

If you do determine it's worth the added complexity, it's relatively simple to put your async tasks into a throttled, concurrent pipeline. Your tasks are inherently asynchronous, so don't bother with multi-threading. As you test your process on chunks of a few hundred users, you'll probably find a point of diminishing returns on your throttle size. For example, you might find that importing 10 users at a time gives you an 8x performance boost, whereas importing 20 at a time only improves your speed minimally, and 50 at a time makes the server start rejecting your requests. Don't be greedy: set the throttle at the sweet spot where you're getting great performance gains with relatively low concurrency.

The stuff you're doing to translate the JSON string into a byte array before sending it seems superfluous. Why not just use StringContent instead?

CodePudding user response：

Just some higher-level advice:

Push back with your service provider on only accepting one record at a time. They really ought to provide a bulk import option for this.
Failing that, make sure you have effective logging and alerting for any issues.
The logging should include success messages for each record that are easy to parse. This is separate from the logs for errors or failures.
The logging should include an exception report log with the entire data record for any record that fails; this is separate from the error log, which might only include the main identifier of the failed record along with the error message. Something like a CSV file you just can just re-run all at once after you understand and fix the error messages recorded in the main log.
With this effective logging in place, structure the error handling so one failure won't stop your application from continuing to process remaining records.
Build the the process to be idempotent. That is, make sure it's safe to accidently run the job additional times on the same set of users. Item #3 may help with this.
Build the process so, for logging purpsoses, an entire user either fails or succeeds, and re-running the process for a failed user that had prior partial success will catch them up to be fully completed.
Look for natural checkpoints where you can know you've handled everything up to that point, so if there are issues later runs don't have to start from the beginning. For example, maybe break up the entire set by the first letter of the Display Name and run 26 separate batches.
Don't get too caught up worrying about performance at this point. It's probably possible to write this to run within a few minutes, but if it takes a whole day to import the large initial data set, it takes a day... but then it's done, and you don't have to think about it anymore. At this point, reliability is the far greater concern.