Why doesn't the Linux Kernel use huge pages?-CodePudding

During my browsing I came across of a thing called hugepages, hugepages mechanism makes it possible to map 2M and even 1G pages using entries in the second and the third level page tables, and as the kernel docs itself says that the:

Usage of huge pages significantly reduces pressure on TLB, improves TLB hit-rate and thus improves overall system performance.

I browsed the kernel source as well and I didn't see any usage of MAP_HUGETLB when it comes to mmap. In fact, /proc/sys/vm/nr_hugepages is set to 0 by default. Why is that? Does it mean the kernel has no need in huge pages at all? What are some examples of scenarios where huge pages are a must?

For the sake of example:

hugepage = mmap(0, getpagesize() * 4, PROT_WRITE | PROT_READ,
                    MAP_ANON | MAP_HUGETLB | MAP_PRIVATE, 0, 0);

CodePudding user response：

The Linux kernel's approach to huge pages is to mainly let system administrators manage them from userspace. This is mostly because as cool as they might sound, huge pages can also have drawbacks: for example, they cannot be swapped to disk. This LWN series on huge pages gives a lot of information on the topic.

By default there are no huge pages reserved, and one can reserve them at boot time through the boot parameters hugepagesz= and hugepages= (specified multiple times for multiple huge page sizes). Huge pages can also be reserved at runtime through /proc/sys/vm/nr_hugepages and /sys/kernel/mm/hugepages/hugepages-*/nr_hugepages. Furthermore, they can be "dynamically" reserved by the kernel if .../nr_overcommit_hugepages is set higher than .../nr_hugepages. These numbers are reflected in /proc/meminfo under the various HugePages_XXX stats, which are for the default huge page size (Hugepagesize).

File-backed mappings only support huge pages if the file resides in a hugetlbfs filesystem, and only of the specific size specified at mount time (mount option pagesize=). The hugeadm command-line tool, among other things, can give info about currently mounted hugetlbfs FSs with --list-all-mounts. One major reason for wanting a hugetlbfs mounted on your system is to enable huge page support in QEMU/libvirt guests.

All of the above covers "voluntary" huge pages allocations done with MAP_HUGETLB.

Linux also supports transparent huge pages (THP). Normal pages can be transparently made huge (or vice-versa existing transparent huge pages can be broken into normal pages) when needed by the kernel. This is without the need for MAP_HUGETLB, and regardless of nr_hugepages in sysfs.

There are some sysfs knobs to control THPs too. The most notable one being /sys/kernel/mm/transparent_hugepage/enabled: always means that the kernel will try to create THPs even without userspace programs actively suggesting it; madvise means that it will do so only if userspace programs suggests it through madvise(addr, len, MADV_HUGEPAGE); never means they are disabled. You'll probably see this set to always by default in modern Linux distros e.g. recent releases of Debian or Ubuntu.

As an example, doing mmap(0x123 << 21, 2*1024*1024, 7, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) with /sys/kernel/mm/transparent_hugepage/enabled set to always should result in a 2M transparent huge page since the requested mapping is aligned to 2M (notice the absence of MAP_HUGETLB).

Does it mean the kernel has no need in huge pages at all? What are some examples of scenarios where huge pages are a must?

In general, you don't really need huge pages of any kind, you can very well live without them. They are just an optimization. Scenarios where they can be useful are, as mentioned by @Mgetz in the comments above, cases where you have a lot of random memory accesses on very large files (common for databases). Minimizing TLB pressure in such cases can result in significant performance improvements.

CodePudding user response：

One place the kernel uses large pages is copying the map of kernel pages into the user process map. See pti_clone_kernel_text. It uses pmd size where it can (2MB) and pte (4K) for the rest. For a 10MB kernel, this means the kernel map takes only a small number of entries.