xref: /linux/Documentation/arch/x86/shstk.rst (revision a23e1966932464e1c5226cb9ac4ce1d5fc10ba22)
11eb2b784SRick Edgecombe.. SPDX-License-Identifier: GPL-2.0
21eb2b784SRick Edgecombe
31eb2b784SRick Edgecombe======================================================
41eb2b784SRick EdgecombeControl-flow Enforcement Technology (CET) Shadow Stack
51eb2b784SRick Edgecombe======================================================
61eb2b784SRick Edgecombe
71eb2b784SRick EdgecombeCET Background
81eb2b784SRick Edgecombe==============
91eb2b784SRick Edgecombe
101eb2b784SRick EdgecombeControl-flow Enforcement Technology (CET) covers several related x86 processor
111eb2b784SRick Edgecombefeatures that provide protection against control flow hijacking attacks. CET
121eb2b784SRick Edgecombecan protect both applications and the kernel.
131eb2b784SRick Edgecombe
141eb2b784SRick EdgecombeCET introduces shadow stack and indirect branch tracking (IBT). A shadow stack
151eb2b784SRick Edgecombeis a secondary stack allocated from memory which cannot be directly modified by
161eb2b784SRick Edgecombeapplications. When executing a CALL instruction, the processor pushes the
171eb2b784SRick Edgecombereturn address to both the normal stack and the shadow stack. Upon
181eb2b784SRick Edgecombefunction return, the processor pops the shadow stack copy and compares it
191eb2b784SRick Edgecombeto the normal stack copy. If the two differ, the processor raises a
201eb2b784SRick Edgecombecontrol-protection fault. IBT verifies indirect CALL/JMP targets are intended
211eb2b784SRick Edgecombeas marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow
221eb2b784SRick EdgecombeStack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace
231eb2b784SRick Edgecombeshadow stack and kernel IBT are supported.
241eb2b784SRick Edgecombe
251eb2b784SRick EdgecombeRequirements to use Shadow Stack
261eb2b784SRick Edgecombe================================
271eb2b784SRick Edgecombe
281eb2b784SRick EdgecombeTo use userspace shadow stack you need HW that supports it, a kernel
291eb2b784SRick Edgecombeconfigured with it and userspace libraries compiled with it.
301eb2b784SRick Edgecombe
311eb2b784SRick EdgecombeThe kernel Kconfig option is X86_USER_SHADOW_STACK.  When compiled in, shadow
321eb2b784SRick Edgecombestacks can be disabled at runtime with the kernel parameter: nousershstk.
331eb2b784SRick Edgecombe
341eb2b784SRick EdgecombeTo build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later
351eb2b784SRick Edgecombeare required.
361eb2b784SRick Edgecombe
371eb2b784SRick EdgecombeAt run time, /proc/cpuinfo shows CET features if the processor supports
381eb2b784SRick EdgecombeCET. "user_shstk" means that userspace shadow stack is supported on the current
391eb2b784SRick Edgecombekernel and HW.
401eb2b784SRick Edgecombe
411eb2b784SRick EdgecombeApplication Enabling
421eb2b784SRick Edgecombe====================
431eb2b784SRick Edgecombe
441eb2b784SRick EdgecombeAn application's CET capability is marked in its ELF note and can be verified
451eb2b784SRick Edgecombefrom readelf/llvm-readelf output::
461eb2b784SRick Edgecombe
471eb2b784SRick Edgecombe    readelf -n <application> | grep -a SHSTK
481eb2b784SRick Edgecombe        properties: x86 feature: SHSTK
491eb2b784SRick Edgecombe
501eb2b784SRick EdgecombeThe kernel does not process these applications markers directly. Applications
511eb2b784SRick Edgecombeor loaders must enable CET features using the interface described in section 4.
521eb2b784SRick EdgecombeTypically this would be done in dynamic loader or static runtime objects, as is
531eb2b784SRick Edgecombethe case in GLIBC.
541eb2b784SRick Edgecombe
551eb2b784SRick EdgecombeEnabling arch_prctl()'s
561eb2b784SRick Edgecombe=======================
571eb2b784SRick Edgecombe
581eb2b784SRick EdgecombeElf features should be enabled by the loader using the below arch_prctl's. They
591eb2b784SRick Edgecombeare only supported in 64 bit user applications. These operate on the features
601eb2b784SRick Edgecombeon a per-thread basis. The enablement status is inherited on clone, so if the
611eb2b784SRick Edgecombefeature is enabled on the first thread, it will propagate to all the thread's
621eb2b784SRick Edgecombein an app.
631eb2b784SRick Edgecombe
641eb2b784SRick Edgecombearch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
651eb2b784SRick Edgecombe    Enable a single feature specified in 'feature'. Can only operate on
661eb2b784SRick Edgecombe    one feature at a time.
671eb2b784SRick Edgecombe
681eb2b784SRick Edgecombearch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
691eb2b784SRick Edgecombe    Disable a single feature specified in 'feature'. Can only operate on
701eb2b784SRick Edgecombe    one feature at a time.
711eb2b784SRick Edgecombe
721eb2b784SRick Edgecombearch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
731eb2b784SRick Edgecombe    Lock in features at their current enabled or disabled status. 'features'
741eb2b784SRick Edgecombe    is a mask of all features to lock. All bits set are processed, unset bits
751eb2b784SRick Edgecombe    are ignored. The mask is ORed with the existing value. So any feature bits
761eb2b784SRick Edgecombe    set here cannot be enabled or disabled afterwards.
771eb2b784SRick Edgecombe
78680ed2f1SMike Rapoportarch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features)
79680ed2f1SMike Rapoport    Unlock features. 'features' is a mask of all features to unlock. All
80680ed2f1SMike Rapoport    bits set are processed, unset bits are ignored. Only works via ptrace.
81680ed2f1SMike Rapoport
8267840ad0SRick Edgecombearch_prctl(ARCH_SHSTK_STATUS, unsigned long addr)
8367840ad0SRick Edgecombe    Copy the currently enabled features to the address passed in addr. The
8467840ad0SRick Edgecombe    features are described using the bits passed into the others in
8567840ad0SRick Edgecombe    'features'.
8667840ad0SRick Edgecombe
871eb2b784SRick EdgecombeThe return values are as follows. On success, return 0. On error, errno can
881eb2b784SRick Edgecombebe::
891eb2b784SRick Edgecombe
901eb2b784SRick Edgecombe        -EPERM if any of the passed feature are locked.
911eb2b784SRick Edgecombe        -ENOTSUPP if the feature is not supported by the hardware or
921eb2b784SRick Edgecombe         kernel.
931eb2b784SRick Edgecombe        -EINVAL arguments (non existing feature, etc)
9467840ad0SRick Edgecombe        -EFAULT if could not copy information back to userspace
951eb2b784SRick Edgecombe
961eb2b784SRick EdgecombeThe feature's bits supported are::
971eb2b784SRick Edgecombe
981eb2b784SRick Edgecombe    ARCH_SHSTK_SHSTK - Shadow stack
991eb2b784SRick Edgecombe    ARCH_SHSTK_WRSS  - WRSS
1001eb2b784SRick Edgecombe
1011eb2b784SRick EdgecombeCurrently shadow stack and WRSS are supported via this interface. WRSS
1021eb2b784SRick Edgecombecan only be enabled with shadow stack, and is automatically disabled
1031eb2b784SRick Edgecombeif shadow stack is disabled.
1041eb2b784SRick Edgecombe
1051eb2b784SRick EdgecombeProc Status
1061eb2b784SRick Edgecombe===========
1071eb2b784SRick EdgecombeTo check if an application is actually running with shadow stack, the
1081eb2b784SRick Edgecombeuser can read the /proc/$PID/status. It will report "wrss" or "shstk"
1091eb2b784SRick Edgecombedepending on what is enabled. The lines look like this::
1101eb2b784SRick Edgecombe
1111eb2b784SRick Edgecombe    x86_Thread_features: shstk wrss
1121eb2b784SRick Edgecombe    x86_Thread_features_locked: shstk wrss
1131eb2b784SRick Edgecombe
1141eb2b784SRick EdgecombeImplementation of the Shadow Stack
1151eb2b784SRick Edgecombe==================================
1161eb2b784SRick Edgecombe
1171eb2b784SRick EdgecombeShadow Stack Size
1181eb2b784SRick Edgecombe-----------------
1191eb2b784SRick Edgecombe
1201eb2b784SRick EdgecombeA task's shadow stack is allocated from memory to a fixed size of
1211eb2b784SRick EdgecombeMIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
1221eb2b784SRick Edgecombethe maximum size of the normal stack, but capped to 4 GB. In the case
1231eb2b784SRick Edgecombeof the clone3 syscall, there is a stack size passed in and shadow stack
1241eb2b784SRick Edgecombeuses this instead of the rlimit.
1251eb2b784SRick Edgecombe
1261eb2b784SRick EdgecombeSignal
1271eb2b784SRick Edgecombe------
1281eb2b784SRick Edgecombe
1291eb2b784SRick EdgecombeThe main program and its signal handlers use the same shadow stack. Because
1301eb2b784SRick Edgecombethe shadow stack stores only return addresses, a large shadow stack covers
1311eb2b784SRick Edgecombethe condition that both the program stack and the signal alternate stack run
1321eb2b784SRick Edgecombeout.
1331eb2b784SRick Edgecombe
1341eb2b784SRick EdgecombeWhen a signal happens, the old pre-signal state is pushed on the stack. When
1351eb2b784SRick Edgecombeshadow stack is enabled, the shadow stack specific state is pushed onto the
1361eb2b784SRick Edgecombeshadow stack. Today this is only the old SSP (shadow stack pointer), pushed
1371eb2b784SRick Edgecombein a special format with bit 63 set. On sigreturn this old SSP token is
1381eb2b784SRick Edgecombeverified and restored by the kernel. The kernel will also push the normal
1391eb2b784SRick Edgecomberestorer address to the shadow stack to help userspace avoid a shadow stack
1401eb2b784SRick Edgecombeviolation on the sigreturn path that goes through the restorer.
1411eb2b784SRick Edgecombe
1421eb2b784SRick EdgecombeSo the shadow stack signal frame format is as follows::
1431eb2b784SRick Edgecombe
1441eb2b784SRick Edgecombe    |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format
1451eb2b784SRick Edgecombe                    (bit 63 set to 1)
1461eb2b784SRick Edgecombe    |        ...| - Other state may be added in the future
1471eb2b784SRick Edgecombe
1481eb2b784SRick Edgecombe
1491eb2b784SRick Edgecombe32 bit ABI signals are not supported in shadow stack processes. Linux prevents
1501eb2b784SRick Edgecombe32 bit execution while shadow stack is enabled by the allocating shadow stacks
1511eb2b784SRick Edgecombeoutside of the 32 bit address space. When execution enters 32 bit mode, either
1521eb2b784SRick Edgecombevia far call or returning to userspace, a #GP is generated by the hardware
1531eb2b784SRick Edgecombewhich, will be delivered to the process as a segfault. When transitioning to
1541eb2b784SRick Edgecombeuserspace the register's state will be as if the userspace ip being returned to
1551eb2b784SRick Edgecombecaused the segfault.
1561eb2b784SRick Edgecombe
1571eb2b784SRick EdgecombeFork
1581eb2b784SRick Edgecombe----
1591eb2b784SRick Edgecombe
1601eb2b784SRick EdgecombeThe shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
1611eb2b784SRick Edgecombeto be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
1621eb2b784SRick Edgecombeshadow access triggers a page fault with the shadow stack access bit set
1631eb2b784SRick Edgecombein the page fault error code.
1641eb2b784SRick Edgecombe
1651eb2b784SRick EdgecombeWhen a task forks a child, its shadow stack PTEs are copied and both the
1661eb2b784SRick Edgecombeparent's and the child's shadow stack PTEs are cleared of the dirty bit.
1671eb2b784SRick EdgecombeUpon the next shadow stack access, the resulting shadow stack page fault
1681eb2b784SRick Edgecombeis handled by page copy/re-use.
1691eb2b784SRick Edgecombe
1701eb2b784SRick EdgecombeWhen a pthread child is created, the kernel allocates a new shadow stack
1711eb2b784SRick Edgecombefor the new thread. New shadow stack creation behaves like mmap() with respect
1721eb2b784SRick Edgecombeto ASLR behavior. Similarly, on thread exit the thread's shadow stack is
1731eb2b784SRick Edgecombedisabled.
1741eb2b784SRick Edgecombe
1751eb2b784SRick EdgecombeExec
1761eb2b784SRick Edgecombe----
1771eb2b784SRick Edgecombe
1781eb2b784SRick EdgecombeOn exec, shadow stack features are disabled by the kernel. At which point,
1791eb2b784SRick Edgecombeuserspace can choose to re-enable, or lock them.
180