xref: /linux/Documentation/userspace-api/mseal.rst (revision b37981ce540dffa64a4664ccf0e20dbef6c2c638)
1.. SPDX-License-Identifier: GPL-2.0
2
3=====================
4Introduction of mseal
5=====================
6
7:Author: Jeff Xu <jeffxu@chromium.org>
8
9Modern CPUs support memory permissions such as RW and NX bits. The memory
10permission feature improves security stance on memory corruption bugs, i.e.
11the attacker can’t just write to arbitrary memory and point the code to it,
12the memory has to be marked with X bit, or else an exception will happen.
13
14Memory sealing additionally protects the mapping itself against
15modifications. This is useful to mitigate memory corruption issues where a
16corrupted pointer is passed to a memory management system. For example,
17such an attacker primitive can break control-flow integrity guarantees
18since read-only memory that is supposed to be trusted can become writable
19or .text pages can get remapped. Memory sealing can automatically be
20applied by the runtime loader to seal .text and .rodata pages and
21applications can additionally seal security critical data at runtime.
22
23A similar feature already exists in the XNU kernel with the
24VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
25
26SYSCALL
27=======
28mseal syscall signature
29-----------------------
30   ``int mseal(void *addr, size_t len, unsigned long flags)``
31
32   **addr**/**len**: virtual memory address range.
33      The address range set by **addr**/**len** must meet:
34         - The start address must be in an allocated VMA.
35         - The start address must be page aligned.
36         - The end address (**addr** + **len**) must be in an allocated VMA.
37         - no gap (unallocated memory) between start and end address.
38
39      The ``len`` will be paged aligned implicitly by the kernel.
40
41   **flags**: reserved for future use.
42
43   **Return values**:
44      - **0**: Success.
45      - **-EINVAL**:
46         * Invalid input ``flags``.
47         * The start address (``addr``) is not page aligned.
48         * Address range (``addr`` + ``len``) overflow.
49      - **-ENOMEM**:
50         * The start address (``addr``) is not allocated.
51         * The end address (``addr`` + ``len``) is not allocated.
52         * A gap (unallocated memory) between start and end address.
53      - **-EPERM**:
54         * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
55
56   **Note about error return**:
57      - For above error cases, users can expect the given memory range is
58        unmodified, i.e. no partial update.
59      - There might be other internal errors/cases not listed here, e.g.
60        error during merging/splitting VMAs, or the process reaching the maximum
61        number of supported VMAs. In those cases, partial updates to the given
62        memory range could happen. However, those cases should be rare.
63
64   **Architecture support**:
65      mseal only works on 64-bit CPUs, not 32-bit CPUs.
66
67   **Idempotent**:
68      users can call mseal multiple times. mseal on an already sealed memory
69      is a no-action (not error).
70
71   **no munseal**
72      Once mapping is sealed, it can't be unsealed. The kernel should never
73      have munseal, this is consistent with other sealing feature, e.g.
74      F_SEAL_SEAL for file.
75
76Blocked mm syscall for sealed mapping
77-------------------------------------
78   It might be important to note: **once the mapping is sealed, it will
79   stay in the process's memory until the process terminates**.
80
81   Example::
82
83         *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
84         rc = mseal(ptr, 4096, 0);
85         /* munmap will fail */
86         rc = munmap(ptr, 4096);
87         assert(rc < 0);
88
89   Blocked mm syscall:
90      - munmap
91      - mmap
92      - mremap
93      - mprotect and pkey_mprotect
94      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
95        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
96
97   The first set of syscalls to block is munmap, mremap, mmap. They can
98   either leave an empty space in the address space, therefore allowing
99   replacement with a new mapping with new set of attributes, or can
100   overwrite the existing mapping with another mapping.
101
102   mprotect and pkey_mprotect are blocked because they changes the
103   protection bits (RWX) of the mapping.
104
105   Certain destructive madvise behaviors, specifically MADV_DONTNEED,
106   MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
107   risks when applied to anonymous memory by threads lacking write
108   permissions. Consequently, these operations are prohibited under such
109   conditions. The aforementioned behaviors have the potential to modify
110   region contents by discarding pages, effectively performing a memset(0)
111   operation on the anonymous memory.
112
113   Kernel will return -EPERM for blocked syscalls.
114
115   When blocked syscall return -EPERM due to sealing, the memory regions may
116   or may not be changed, depends on the syscall being blocked:
117
118      - munmap: munmap is atomic. If one of VMAs in the given range is
119        sealed, none of VMAs are updated.
120      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
121        when mprotect over multiple VMAs, mprotect might update the beginning
122        VMAs before reaching the sealed VMA and return -EPERM.
123      - mmap and mremap: undefined behavior.
124
125Use cases
126=========
127- glibc:
128  The dynamic linker, during loading ELF executables, can apply sealing to
129  mapping segments.
130
131- Chrome browser: protect some security sensitive data structures.
132
133- System mappings:
134  The system mappings are created by the kernel and includes vdso, vvar,
135  vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode), uprobes.
136
137  Those system mappings are readonly only or execute only, memory sealing can
138  protect them from ever changing to writable or unmmap/remapped as different
139  attributes. This is useful to mitigate memory corruption issues where a
140  corrupted pointer is passed to a memory management system.
141
142  If supported by an architecture (CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS),
143  the CONFIG_MSEAL_SYSTEM_MAPPINGS seals all system mappings of this
144  architecture.
145
146  The following architectures currently support this feature: x86-64, arm64,
147  loongarch and s390.
148
149  WARNING: This feature breaks programs which rely on relocating
150  or unmapping system mappings. Known broken software at the time
151  of writing includes CHECKPOINT_RESTORE, UML, gVisor, rr. Therefore
152  this config can't be enabled universally.
153
154When not to use mseal
155=====================
156Applications can apply sealing to any virtual memory region from userspace,
157but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
158apply the sealing. This is because the sealed mapping *won’t be unmapped*
159until the process terminates or the exec system call is invoked.
160
161For example:
162   - aio/shm
163     aio/shm can call mmap and  munmap on behalf of userspace, e.g.
164     ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
165     the lifetime of the process. If those memories are sealed from userspace,
166     then munmap will fail, causing leaks in VMA address space during the
167     lifetime of the process.
168
169   - ptr allocated by malloc (heap)
170     Don't use mseal on the memory ptr return from malloc().
171     malloc() is implemented by allocator, e.g. by glibc. Heap manager might
172     allocate a ptr from brk or mapping created by mmap.
173     If an app calls mseal on a ptr returned from malloc(), this can affect
174     the heap manager's ability to manage the mappings; the outcome is
175     non-deterministic.
176
177     Example::
178
179        ptr = malloc(size);
180        /* don't call mseal on ptr return from malloc. */
181        mseal(ptr, size);
182        /* free will success, allocator can't shrink heap lower than ptr */
183        free(ptr);
184
185mseal doesn't block
186===================
187In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
188attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
189memory is immutable.
190
191As Jann Horn pointed out in [3], there are still a few ways to write
192to RO memory, which is, in a way, by design. And those could be blocked
193by different security measures.
194
195Those cases are:
196
197   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
198   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
199   - userfaultfd.
200
201The idea that inspired this patch comes from Stephen Röttger’s work in V8
202CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
203
204Reference
205=========
206- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
207- [2] https://man.openbsd.org/mimmutable.2
208- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
209- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
210