How is TreeSize Free so fast at listing folder sizes?-CodePudding

The program TreeSize Free can list folders inside a directory and sort them descendingly according to their file sizes to find the biggest folders/files for cleaning up your harddrive. I'm wondering how they're doing this so fast. Assuming I wanted to calculate a folder size efficiently in C , I would for example use the following modern C code:

size_t calculate_directory_size(const std::filesystem::path& directory, const size_t maximum_size)
{
    size_t size = 0;
    for (std::filesystem::recursive_directory_iterator directory_iterator(directory);
        directory_iterator != std::filesystem::recursive_directory_iterator();   directory_iterator)
    {
        if (!is_directory(*directory_iterator))
        {
            size  = file_size(*directory_iterator);
        }

        if (size > maximum_size)
        {
            // Save time
            return size;
        }
    }

    return size;
}

However, running this code even in optimized builds is significantly slower than what TreeSize can do (e.g. at least factor 3 - 4 slower). Is there any trick to faster iteration and summing up file sizes than how my implementation does it? I don't think that "clever multi-threading" will provide such a huge benefit since disk accesses can't really be multi-threaded for a huge performance gain.

CodePudding user response：

Multithreading

disk accesses can't really be multi-threaded for a huge performance gain

This is not true in general. This is mainly dependent of the hardware and the operating system (OS) stack (file system, drivers, actual OS, etc.).

This is often generally true for hard disk drive (HDD). Indeed, they are inherently sequential mainly due to the magnetic head and the spinning disk. However, a good OS stack can cleverly prioritize the operation regarding the location of the head in real time and the location of the blocks to be fetched. Still, the speed of HDD is mostly bound by the huge seek time and the hierarchy of searched files is nearly never contiguous on most modern file systems (though there is a cache to avoid many fetches).

For solid-state drive (SSD), this is more complex: the time to fetch a block is far smaller but it still has a significant latency. Requesting multiple files asynchronously can be much faster than doing a synchronous loop waiting for each block to be received so to then request a new block. Modern NVMe SSD can achieve hundred of thousands of IO request per second so asynchronous operations are critical. Using multiple threads is a way to make things more asynchronous though it is generally not very efficient.

TreeSize uses multiple threads making the computation faster on my machine with a NVMe SSD (Samsung 970 EVO Plus) and a i5-9600KF processor. Here is the (approximate) timings for the C:\Windows directory:

1 core: 11.0 s
2 core:  9.0 s
6 core:  7.5 s

The timings has been produced by tuning the affinity of the threads to a fixed number of cores. Using multiple threads is not a silver bullet, but still significantly better than doing the operation serially on some platforms for the TreeSize code.

Note that profiling informations show that only 3 TreeSize threads are actually simultaneously active during a directory scan. One of them is clearly less active and it seems to manage all the (GUI) events while the two others do IO operations. This can also explain why the operation does not scale well.

Performance of the C standard library

Even using 1 core, there is a big performance gap between TreeSize and your C code. Indeed, on my machine, the former takes 11 second while the later takes 46 seconds using the GNU C compiler.

A low-level profiling analysis shows that most of the time of your C code is spent in 7 functions:

Time | Function name
--------------------------------------------------------------------------
 28% | std::filesystem::status
 25% | std::filesystem::__cxx11::recursive_directory_iterator::operator  
 20% | std::filesystem::file_size
 11% | GetFileAttributesW
  5% | wfindfirst64
  3% | wfindnext64
  2% | findclose
 ... | ...

Based on the profiling information, about 75% of the time is spent in the stdlibc library and not in the kernel. I do not know really why since the profiler do not have access the the compiled code of the libstdc library used here. That being said, this clearly does not seem reasonable. In fact, GetFileAttributesW should not be needed regarding the use-case. Indeed, the wfindfirst64 and wfindnext64 already provide the information about the file size and the file name. This implementation of the recursive_directory_iterator is simply inefficient. However, it may not be the case for all standard C library implementations.

Fast implementation for Windows

One can write a basic code directly using the Win32 API. More specifically, the FindFirstFileW and FindNextFileW Win32 calls:

size_t calculate_directory_size_win32(const fs::path& directory, const size_t maximum_size)
{
    size_t size = 0;
    WIN32_FIND_DATAW infos;
    std::vector<std::wstring> paths_to_scan;

    paths_to_scan.push_back(directory.wstring());

    while(not paths_to_scan.empty())
    {
        std::wstring current_path = std::move(paths_to_scan.back());
        paths_to_scan.pop_back();

        HANDLE hFind = FindFirstFileW((current_path   L"\\*").c_str(), &infos);

        if(hFind != INVALID_HANDLE_VALUE)
        {
            do
            {
                if (infos.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
                {
                    if(wcscmp(infos.cFileName, L".") != 0 && wcscmp(infos.cFileName, L"..") != 0)
                        paths_to_scan.push_back(current_path   L'\\'   infos.cFileName);
                }
                else
                {
                    size  = (size_t(infos.nFileSizeHigh) << 32) | infos.nFileSizeLow;
                }

                if (size > maximum_size)
                    return size;
            }
            while(FindNextFileW(hFind, &infos) != 0);

            FindClose(hFind);
        }
    }

    return size;
}

The above code supports basic directories (on may need additional checks for special entities like symlinks) and it is much faster on my machine: it only takes 8 seconds.

When it comes to TreeSize, most of the time is spent in CreateFileW and CloseFileW when a refresh is done. This is a bit surprising unless they just update the size of each file only if needed based with a file-tree cache saved somewhere.

Note that in all benchmark, the operations are repeated multiple times and so most of the IO operations do not actually interact with the SSD thanks to a relatively-fast OS cache.