Intel & AMD Micro-Architecture Extended Instruction Sets

The following is a list of architectures where certain instruction sets have been introduced first. The column "instr. set" only lists the introduced, not all instruction sets.


Only includes consumer CPUs, not Xeons or other prosumer hardware.

Year uArch instr. set
2007 Intel Core SSE, SSE2, SSE3, SSSE3, SSE4
2007 Penryn SSE4.1, VT-x, VT-d
2008 Nehalem SSE4.2
2010 Westmere AES-NI, CLMUL
2011 Sandy Bridge AVX, TXT
2012 Ivy Bridge F16C
2013 Haswell FMA3, AVX2, TSX (only Haswell-EX)
2015 Skylake MPX, SGX, HEVC
2016 Kaby Lake -
2017 Coffee Lake -
2018 Cannon Lake AVX-512, SHA
2018 Cascade Lake TBD
2018 Whiskey Lake TBD
2019 Ice Lake TBD


Year uArch instr. set
2003 Hammer (K8) SSE, SSE2 (SSE3, starting with Athlon64)
2007 K10 AMD-V, SSE4a
2011 Bobcat (K14) ABM
2011 Bulldozer (K15) SSE4.1, SSE4.2, AES, CLMUL, AVX, XOP, FMA4, F16C
2012 Piledriver (K15) FMA3
2012 Steamroller (K15) HEVC
2013 Jaguar (K16) MOVBE
2017 Zen (K17) AVX2, SHA, ADX, RDSEE

If you see any errors, please contact me on Twitter @ArvidGerstmann.


PDBs on Linux

On Windows, debugger symbols aren't stored side-by-side with the executable data in the same file. They're stored in a .pdb file (short for "program database"). This is especially great if you distribute your program to end-users, but still want to be able to debug any crashes. Just keep the .pdb file somewhere save, and any crash log you get send can easily translated back into source locations.

On Linux, debug symbols are traditionally stored inside the executable, and stripped (using strip(1)) before distributing. This takes away the possibility to debug any crash, send to you from the stripped executable.

Today I discovered a neat little trick to create something resembling PDBs on Linux, by making use of objcopy(1). In our example, I already have compiled an executable a.out, which I want to distribute to my users:

$ ls -lah a.out
-rwxr-xr-x  1 woot woot 30K May  5 11:26 a.out 

As we can see, the executable, with debug symbols, has a size of 30k.

Now we extract the debug symbols into another file, using objcopy(1):

$ objcopy --only-keep-debug a.out a.out.pdb
$ strip a.out

As we can now see, our debug symbols are extracted and removed from a.out:

$ ls -lah .
-rwxr-xr-x  1 woot woot 6.3K May  5 11:26 a.out
-rwxr-xr-x  1 woot woot  28K May  5 11:25 a.out.pdb

If we now want to debug a.out, however, gdb(1) is telling us it's missing debug information:

$ gdb a.out
Reading symbols from a.out...(no debugging symbols found)...done.

We need to attach the .pdb symbols to a.out. This can be achieved by making use of GNUs .gnu_debuglink directives:

$ objcopy --add-gnu-debuglink=a.out.pdb a.out

To confirm it's working, we start gdb(1), to see whether it can now pick up any symbols:

$ gdb a.out
Reading symbols from a.out...Reading symbols from /home/woot/tmp/pdbtest/a.out.pdb...done.

Lovely! Our a.out can now be distributed & debugged with external symbols.


$ objcopy --only-keep-debug a.out a.out.pdb # extract symbols
$ strip a.out # strip away any debug information
$ objcopy --add-gnu-debuglink=a.out.pdb a.out # attach the symbols to the executable


Memory Mapping on Windows (including Benchmark)

The following tweet sparked my interest to investigate further into the different ways of mapping and unmapping memory on Windows, and trying to find the "best" way:

If you're familiar with the memory mapping techniques on Windows, you may skip to the benchmark and conclusion.

Mapping Memory

As a quick recap, Windows has multiple ways of mapping (allocating) virtual memory from the operating system: VirtualAlloc or CreateFileMapping.

VirtualAlloc is pretty straight forward to use, you simply pass in the desired address (or NULL, if you are fine with the decision of the OS), size (as a multiple of dwAllocationGranularity) and flags and you get back the address of the newly allocated memory region.

CreateFileMapping is a little more complicated, since it's actual purpose is to map files on disk to memory. However, if you pass in a size and INVALID_HANDLE_VALUE instead of a valid file handle, you'll get a file mapping which is backed by the systems page file. The resulting file handle has to be mapped with MapViewOfFile.

Since Windows does not support overcommitting, unlike Linux, all allocated memory, regardless of method, must always be backed by either the swap file[1] or physical memory.

Unmapping Memory

Now that we know how to map memory on Windows, let's look into unmapping. Contrary to any logic, there are four, not just two, ways of unmapping:

  • VirtualFree, with MEM_RELEASE
  • VirtualFree, with MEM_DECOMMIT
  • VirtualAlloc, using MEM_RESET & MEM_RESET_UNDO
  • UnmapViewOfFile

Using UnmapViewOfFile is the only legal way to unmap a region mapped with MapViewOfFile, the former three ways are legal for any memory allocated using VirtualAlloc.


One might think that all function somewhat similar. Either of the three ways should technically only change a couple of bits in some data structure in the kernel. Apparently so, this is not the case.

Every benchmark consists of mapping the allocated region, overwriting it (using either memset or touching the first byte of the 64k region, more on that later), to make sure all of the memory is actually committed to physical RAM, and than unmapping the region again.
The cost of the pure memset operation is also benchmarked, and subtracted from all benchmarks.

Benchmarks were run on my desktop machine at home (i7 5960X, 64 GiB DDR4-2134 RAM, Samsung 960 Pro). The size of memory region used was 300 MiB (64k * 4800).

Memsetting the whole allocated region

Benchmark Time
memset 21.9ms
VirtualAlloc/VirtualFree 64.4ms
MEM_RESET 77.3ms
UnmapViewOfFile 103.6ms

(the "unnormalized" bar includes the cost of the memset)

Only touching the first byte of every 64k region

Benchmark Time
"memset" 0.1ms
VirtualAlloc/VirtualFree 4.4ms
UnmapViewOfFile 5.6ms

(the "unnormalized" bar includes the cost of the memset)

The code for the benchmark can be found here.

Further Investigation

Further investigation revealed, that MEM_RESET is lazily unmapping the pages[2], dropping them only in the case of memory pressure, while the other ways are actively unmapping (and probably zero'ing) the memory. This would explain the difference in perceived performance.
Releasing the memory will try to "hide" the cost of zero'ing, as explained by this fantastic blog post by Bruce Dawson.


If the intention is to re-use the pages in the near future, prefer to mark them as unused using MEM_RESET. Otherwise, simply releasing the pages is best, and will give Windows a better opportunity to re-use pages.
In general, though, I'd advise against any method, since the performance characteristics is not suited for anything close to (soft-) realtime.

I've yet figure out a fast way, any ideas?
Tweet me at @ArvidGerstmann.

  1. Page file in Windows jargon. ↩︎

  2. As revealed by a developer on jemalloc: ↩︎


Announcing the C++ Tour

C++ Tour Logo

I'm proud to officially announce the C++ Tour.[1]

The tour can be best explained by quoting our mission statement:

The goal of the C++ tour project is to create a new way of teaching C++.
First and foremost we want to target those, who already have some experience
in programming, but are new to C++ or return after a longer absence.

We want to guide through features of the language and standard library, showing
pitfalls and best practices. The tour will be split into chapters, each of which
contains lessons, teaching a single concept or language feature.
Every lesson will be accompanied by an interactive example, demonstrating the
concept and allowing for experimentation.

It'll be available from early next year (current content is a placeholder).

We are looking for help!

The tour is currently being built on,
we have a couple of tickets open looking for feedback.

Please give us a star and share the blog post!

Feel free to just chime in. We're looking for any help we can get to help make
the C++ tour a reality in a timely manner.

It's best to reach us over the Slack channel #cpp-tour on the CppLang slack (click here to join).

  1. The official announcement was done on CppCast and can be heard in Episode 129. ↩︎


Using clang on Windows

Update 1: Visual Studio 2017 works. Thanks to STL.

Disclaimer: This isn't about clang/C2, clang/C2 is Microsoft own fork of clang to work with their backend. This is using clang + llvm.

tl;dr: All the source is in this repository:

Recently Chrome decided to switch their Windows builds to use clang, exclusively. That got me intrigued to try it again, since my former experience of trying to use clang on Windows was rather mixed. However if it's good enough for Chrome, it surely must've improved!

Unfortunately, getting clang to compile MSVC based projects isn't as easy as just dropping in clang and changing a few flags. Let's get started.


You'll need:


Since I want to keep this build-system independent, I've setup a .bat script with all the required steps to compile a simple example. You can grab it here:

Open the build.bat and let's walk through it:

  • Set LLVMPath, VSPath and WinSDKPath to the installation paths of LLVM, VS 2017 and the current Windows Kit.
  • OUTPUT defines the name of the final .exe.
  • CFLAGS contains all your usual clang compiler flags, for our example I've kept them simple.
  • CPPFLAGS defines the include directories of the Universal CRT, C++ Standard Library and Windows SDK.
  • LDLIBS defines the library import paths for the Universal CRT, C++ Standard Library and Windows SDK.
  • MSEXT are the required flags to make clang act more like CL. Not required anymore, Visual Studio 2017 will work without.

The rest of the file is dedicated to compiling all .cc files in the current directory and linking them into an executable.

This example makes use of lld, LLVMs linker. It has a caveat, it's not yet able to fully emit PDBs, you might want to consider to keep using LINK.EXE until lld is fully ready. You can use your normal linking process, the output of clang is fully compatible.

Questions? @ArvidGerstmann on Twitter.