Memory Mapping on Windows (including Benchmark)
The following tweet sparked my interest to investigate further into the different ways of mapping and unmapping memory on Windows, and trying to find the “best” way:
Easy gamedev improvement for a frame allocator: Uses two virtual address ranges. At the end of the frame, decommit (unmap) physical memory of the range that was first used, switch to other virtual address range, and then commit (map) physical memory to it. 1/2
— Mickael Gilabert (@mickaelgilabert) March 13, 2018
If you’re familiar with the memory mapping techniques on Windows, you may skip to the benchmark and conclusion.
Mapping Memory
As a quick recap, Windows has multiple ways of mapping (allocating) virtual memory from the operating system: VirtualAlloc
or CreateFileMapping
.
VirtualAlloc
is pretty straight forward to use, you simply pass in the desired address (or NULL
, if you are fine with the decision of the OS), size (as a multiple of dwAllocationGranularity
) and flags and you get back the address of the newly allocated memory region.
CreateFileMapping
is a little more complicated, since it’s actual purpose is to map files on disk to memory. However, if you pass in a size and INVALID_HANDLE_VALUE
instead of a valid file handle, you’ll get a file mapping which is backed by the systems page file. The resulting file handle has to be mapped with MapViewOfFile
.
Since Windows does not support overcommitting, unlike Linux, all allocated memory, regardless of method, must always be backed by either the swap file1 or physical memory.
Unmapping Memory
Now that we know how to map memory on Windows, let’s look into unmapping. Contrary to any logic, there are four, not just two, ways of unmapping:
VirtualFree
, withMEM_RELEASE
VirtualFree
, withMEM_DECOMMIT
VirtualAlloc
, usingMEM_RESET
&MEM_RESET_UNDO
UnmapViewOfFile
Using UnmapViewOfFile
is the only legal way to unmap a region mapped with MapViewOfFile
, the former three ways are legal for any memory allocated using VirtualAlloc
.
Benchmark
One might think that all function somewhat similar. Either of the three ways should technically only change a couple of bits in some data structure in the kernel. Apparently so, this is not the case.
Every benchmark consists of mapping the allocated region, overwriting it (using either memset
or touching the first byte of the 64k region, more on that later), to make sure all of the memory is actually committed to physical RAM, and than unmapping the region again.
The cost of the pure memset
operation is also benchmarked, and subtracted from all benchmarks.
Benchmarks were run on my desktop machine at home (i7 5960X, 64 GiB DDR4-2134 RAM, Samsung 960 Pro). The size of memory region used was 300 MiB (64k arvid-gerstmann.ghost.2022-01-16.json config.toml content import.sh justfile static templates themes 4800).
Memsetting the whole allocated region
Benchmark | Time |
---|---|
memset | 21.9ms |
VirtualAlloc /VirtualFree | 64.4ms |
MEM_DECOMMIT | 116.8ms |
MEM_RESET | 77.3ms |
UnmapViewOfFile | 103.6ms |
(the “unnormalized” bar includes the cost of the memset)
Only touching the first byte of every 64k region
Benchmark | Time |
---|---|
“memset” | 0.1ms |
VirtualAlloc /VirtualFree | 4.4ms |
MEM_DECOMMIT | 4.6ms |
MEM_RESET | 4.4ms |
UnmapViewOfFile | 5.6ms |
(the “unnormalized” bar includes the cost of the memset)
The code for the benchmark can be found here.
Further Investigation
Further investigation revealed, that MEM_RESET
is lazily unmapping the pages2, dropping them only in the case of memory pressure, while the other ways are actively unmapping (and probably zero’ing) the memory. This would explain the difference in perceived performance.
Releasing the memory will try to “hide” the cost of zero’ing, as explained by this fantastic blog post by Bruce Dawson.
Conclusion
If the intention is to re-use the pages in the near future, prefer to mark them as unused using MEM_RESET
. Otherwise, simply releasing the pages is best, and will give Windows a better opportunity to re-use pages.
In general, though, I’d advise against any method, since the performance characteristics is not suited for anything close to (soft-) realtime.
I’ve yet figure out a fast way, any ideas?
Tweet me at @ArvidGerstmann.
Page file in Windows jargon.
As revealed by a developer on jemalloc: https://github.com/jemalloc/jemalloc/issues/255#issuecomment-130380103