xref: /linux/Documentation/admin-guide/syscall-user-dispatch.rst (revision a2fc422ed75748eef2985454e97847fb22f873c2)
1.. SPDX-License-Identifier: GPL-2.0
2
3=====================
4Syscall User Dispatch
5=====================
6
7Background
8----------
9
10Compatibility layers like Wine need a way to efficiently emulate system
11calls of only a part of their process - the part that has the
12incompatible code - while being able to execute native syscalls without
13a high performance penalty on the native part of the process.  Seccomp
14falls short on this task, since it has limited support to efficiently
15filter syscalls based on memory regions, and it doesn't support removing
16filters.  Therefore a new mechanism is necessary.
17
18Syscall User Dispatch brings the filtering of the syscall dispatcher
19address back to userspace.  The application is in control of a flip
20switch, indicating the current personality of the process.  A
21multiple-personality application can then flip the switch without
22invoking the kernel, when crossing the compatibility layer API
23boundaries, to enable/disable the syscall redirection and execute
24syscalls directly (disabled) or send them to be emulated in userspace
25through a SIGSYS.
26
27The goal of this design is to provide very quick compatibility layer
28boundary crosses, which is achieved by not executing a syscall to change
29personality every time the compatibility layer executes.  Instead, a
30userspace memory region exposed to the kernel indicates the current
31personality, and the application simply modifies that variable to
32configure the mechanism.
33
34There is a relatively high cost associated with handling signals on most
35architectures, like x86, but at least for Wine, syscalls issued by
36native Windows code are currently not known to be a performance problem,
37since they are quite rare, at least for modern gaming applications.
38
39Since this mechanism is designed to capture syscalls issued by
40non-native applications, it must function on syscalls whose invocation
41ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
42doesn't rely on any of the syscall ABI to make the filtering.  It uses
43only the syscall dispatcher address and the userspace key.
44
45As the ABI of these intercepted syscalls is unknown to Linux, these
46syscalls are not instrumentable via ptrace or the syscall tracepoints.
47
48Interface
49---------
50
51A thread can setup this mechanism on supported kernels by executing the
52following prctl:
53
54  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
55
56<op> is either PR_SYS_DISPATCH_EXCLUSIVE_ON/PR_SYS_DISPATCH_INCLUSIVE_ON
57or PR_SYS_DISPATCH_OFF, to enable and disable the mechanism globally for
58that thread.  When PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
59
60For PR_SYS_DISPATCH_EXCLUSIVE_ON [<offset>, <offset>+<length>) delimit
61a memory region interval from which syscalls are always executed directly,
62regardless of the userspace selector.  This provides a fast path for the
63C library, which includes the most common syscall dispatchers in the native
64code applications, and also provides a way for the signal handler to return
65without triggering a nested SIGSYS on (rt\_)sigreturn.  Users of this
66interface should make sure that at least the signal trampoline code is
67included in this region. In addition, for syscalls that implement the
68trampoline code on the vDSO, that trampoline is never intercepted.
69
70For PR_SYS_DISPATCH_INCLUSIVE_ON [<offset>, <offset>+<length>) delimit
71a memory region interval from which syscalls are dispatched based on
72the userspace selector. Syscalls from outside of the range are always
73executed directly.
74
75[selector] is a pointer to a char-sized region in the process memory
76region, that provides a quick way to enable disable syscall redirection
77thread-wide, without the need to invoke the kernel directly.  selector
78can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
79Any other value should terminate the program with a SIGSYS.
80
81Additionally, a tasks syscall user dispatch configuration can be peeked
82and poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace
83requests. This is useful for checkpoint/restart software.
84
85Security Notes
86--------------
87
88Syscall User Dispatch provides functionality for compatibility layers to
89quickly capture system calls issued by a non-native part of the
90application, while not impacting the Linux native regions of the
91process.  It is not a mechanism for sandboxing system calls, and it
92should not be seen as a security mechanism, since it is trivial for a
93malicious application to subvert the mechanism by jumping to an allowed
94dispatcher region prior to executing the syscall, or to discover the
95address and modify the selector value.  If the use case requires any
96kind of security sandboxing, Seccomp should be used instead.
97
98Any fork or exec of the existing process resets the mechanism to
99PR_SYS_DISPATCH_OFF.
100