Isn't findAll() iterator inefficient to assert an unprecedented case in the Database?-CodePudding

I am a newbie constructing persistence APIs with Spring Boot, so I came about this doubt. I don't want the user to be able to register two different Accounts with the same Email. But I also don't want to use Email as primary key in the repo. So, I think the logic is quite self-explanatory.

private static boolean isRegisteredEmail(String email, AccountRepository accountRepository) {
    for ( Account account : accountRepository.findAll() ) {
        if ( account.getEmail().equals(email) ) {
            return true;
        }
    }
    return false;
}

This works, but I start to wonder: if you get a real-case situation, like Facebook's DB in which you have almost 3 billion registers, how do you deal with this scenario? Am I overconcerning, or there's really no chance this can keep being done efficiently?

CodePudding user response：

you don't need to use email as primary key, you can still search for existing users with given email address using either HQL/SQL @Query in your spring data @Repository annotated interface or by composing a method name such as findUserByEmail without writing query (if entity is named user and there is a field email spring data will fetch the data from db for you).
It's much more efficient to select 1 record and not transfer through hundreds/thousands of users every time someone needs to create a new account.

CodePudding user response：

The method findAll() is indeed inefficient to assert an unprecedented case in the database. To understand why it is so, we have to understand in terms of performance.

Performance of any application has many facets to it. But lets tone it down to Time and Space for now.

Issue in the implementation

Loop (Time): The implementation has a condition check inside the loop. So the condition is validated till it finds a match. If match is not found it loops till the list is exhausted. So, best case O(1) but it can reach to O(N) for no match. It can increase further if we have multiple condition checks.
Database (Space): After the database call all the records are kept in memory till we are processing. Fetching only the email ids from database could have helped but here we have the complete Account data stored in memory.
Query (Time & Space): The code is fetching records with the query whenever anyone tries to register. Let say if 100K register everyday, then almost similar data is fetched every time.

Improvements

Query for Unique Emails Ids: Query should fetch only unique email Ids. By this we will save space as well as time. Less space because only Email id is saved rather than whole Account data. For this you can use either JPA or native query.
Repository layer

    String findByEmailId(String emailId);

    @Query(nativeQuery = true, value="select email_Id from account a where a.email_Id = :emailId")
    String findEmailId(String emailId);

But above depends on database performance too.

Cache: Database operations are costly. It always good to maintain a Cache. Rather than fetching data from DB every time. Cache should have a refresh time and should be updated when a new record is inserted in DB. Cache operation is O(1).
Unique Constraint: It's not always optimal to perform validation in code because we have to query all the records and loop on them to validate. This uses a lot of network traffic and space. So, its good to have a unique constraint, because Database can check and return response very fast whenever constraint is violated.
Loop a list or Map: If our use case requires us to perform a loop, then its better to loop in List or Map because O(1) time complexity is for both list.indexOf(obj) and map.get(obj).

    List<String> emailIdList = accountRepository.findAllByEmailId();
    if(emailIdList.indexOf(email)> 0) return true;

Database Type: If it is mandatory to fetch data every time from Db, then its good to query from a No-Sql db as they have very fast retrieval time. The data is stored based on index. Many have Tree type data structure for storage.