1.. SPDX-License-Identifier: GPL-2.0
2
3===========================
4Hypercall Op-codes (hcalls)
5===========================
6
7Overview
8=========
9
10Virtualization on 64-bit Power Book3S Platforms is based on the PAPR
11specification [1]_ which describes the run-time environment for a guest
12operating system and how it should interact with the hypervisor for
13privileged operations. Currently there are two PAPR compliant hypervisors:
14
15- **IBM PowerVM (PHYP)**: IBM's proprietary hypervisor that supports AIX,
16  IBM-i and  Linux as supported guests (termed as Logical Partitions
17  or LPARS). It supports the full PAPR specification.
18
19- **Qemu/KVM**: Supports PPC64 linux guests running on a PPC64 linux host.
20  Though it only implements a subset of PAPR specification called LoPAPR [2]_.
21
22On PPC64 arch a guest kernel running on top of a PAPR hypervisor is called
23a *pSeries guest*. A pseries guest runs in a supervisor mode (HV=0) and must
24issue hypercalls to the hypervisor whenever it needs to perform an action
25that is hypervisor privileged [3]_ or for other services managed by the
26hypervisor.
27
28Hence a Hypercall (hcall) is essentially a request by the pseries guest
29asking hypervisor to perform a privileged operation on behalf of the guest. The
30guest issues a with necessary input operands. The hypervisor after performing
31the privilege operation returns a status code and output operands back to the
32guest.
33
34HCALL ABI
35=========
36The ABI specification for a hcall between a pseries guest and PAPR hypervisor
37is covered in section 14.5.3 of ref [2]_. Switch to the  Hypervisor context is
38done via the instruction **HVCS** that expects the Opcode for hcall is set in *r3*
39and any in-arguments for the hcall are provided in registers *r4-r12*. If values
40have to be passed through a memory buffer, the data stored in that buffer should be
41in Big-endian byte order.
42
43Once control returns back to the guest after hypervisor has serviced the
44'HVCS' instruction the return value of the hcall is available in *r3* and any
45out values are returned in registers *r4-r12*. Again like in case of in-arguments,
46any out values stored in a memory buffer will be in Big-endian byte order.
47
48Powerpc arch code provides convenient wrappers named **plpar_hcall_xxx** defined
49in a arch specific header [4]_ to issue hcalls from the linux kernel
50running as pseries guest.
51
52Register Conventions
53====================
54
55Any hcall should follow same register convention as described in section 2.2.1.1
56of "64-Bit ELF V2 ABI Specification: Power Architecture"[5]_. Table below
57summarizes these conventions:
58
59+----------+----------+-------------------------------------------+
60| Register |Volatile  |  Purpose                                  |
61| Range    |(Y/N)     |                                           |
62+==========+==========+===========================================+
63|   r0     |    Y     |  Optional-usage                           |
64+----------+----------+-------------------------------------------+
65|   r1     |    N     |  Stack Pointer                            |
66+----------+----------+-------------------------------------------+
67|   r2     |    N     |  TOC                                      |
68+----------+----------+-------------------------------------------+
69|   r3     |    Y     |  hcall opcode/return value                |
70+----------+----------+-------------------------------------------+
71|  r4-r10  |    Y     |  in and out values                        |
72+----------+----------+-------------------------------------------+
73|   r11    |    Y     |  Optional-usage/Environmental pointer     |
74+----------+----------+-------------------------------------------+
75|   r12    |    Y     |  Optional-usage/Function entry address at |
76|          |          |  global entry point                       |
77+----------+----------+-------------------------------------------+
78|   r13    |    N     |  Thread-Pointer                           |
79+----------+----------+-------------------------------------------+
80|  r14-r31 |    N     |  Local Variables                          |
81+----------+----------+-------------------------------------------+
82|    LR    |    Y     |  Link Register                            |
83+----------+----------+-------------------------------------------+
84|   CTR    |    Y     |  Loop Counter                             |
85+----------+----------+-------------------------------------------+
86|   XER    |    Y     |  Fixed-point exception register.          |
87+----------+----------+-------------------------------------------+
88|  CR0-1   |    Y     |  Condition register fields.               |
89+----------+----------+-------------------------------------------+
90|  CR2-4   |    N     |  Condition register fields.               |
91+----------+----------+-------------------------------------------+
92|  CR5-7   |    Y     |  Condition register fields.               |
93+----------+----------+-------------------------------------------+
94|  Others  |    N     |                                           |
95+----------+----------+-------------------------------------------+
96
97DRC & DRC Indexes
98=================
99::
100
101     DR1                                  Guest
102     +--+        +------------+         +---------+
103     |  | <----> |            |         |  User   |
104     +--+  DRC1  |            |   DRC   |  Space  |
105                 |    PAPR    |  Index  +---------+
106     DR2         | Hypervisor |         |         |
107     +--+        |            | <-----> |  Kernel |
108     |  | <----> |            |  Hcall  |         |
109     +--+  DRC2  +------------+         +---------+
110
111PAPR hypervisor terms shared hardware resources like PCI devices, NVDIMMs etc
112available for use by LPARs as Dynamic Resource (DR). When a DR is allocated to
113an LPAR, PHYP creates a data-structure called Dynamic Resource Connector (DRC)
114to manage LPAR access. An LPAR refers to a DRC via an opaque 32-bit number
115called DRC-Index. The DRC-index value is provided to the LPAR via device-tree
116where its present as an attribute in the device tree node associated with the
117DR.
118
119HCALL Return-values
120===================
121
122After servicing the hcall, hypervisor sets the return-value in *r3* indicating
123success or failure of the hcall. In case of a failure an error code indicates
124the cause for error. These codes are defined and documented in arch specific
125header [4]_.
126
127In some cases a hcall can potentially take a long time and need to be issued
128multiple times in order to be completely serviced. These hcalls will usually
129accept an opaque value *continue-token* within there argument list and a
130return value of *H_CONTINUE* indicates that hypervisor hasn't still finished
131servicing the hcall yet.
132
133To make such hcalls the guest need to set *continue-token == 0* for the
134initial call and use the hypervisor returned value of *continue-token*
135for each subsequent hcall until hypervisor returns a non *H_CONTINUE*
136return value.
137
138HCALL Op-codes
139==============
140
141Below is a partial list of HCALLs that are supported by PHYP. For the
142corresponding opcode values please look into the arch specific header [4]_:
143
144**H_SCM_READ_METADATA**
145
146| Input: *drcIndex, offset, buffer-address, numBytesToRead*
147| Out: *numBytesRead*
148| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_Hardware*
149
150Given a DRC Index of an NVDIMM, read N-bytes from the metadata area
151associated with it, at a specified offset and copy it to provided buffer.
152The metadata area stores configuration information such as label information,
153bad-blocks etc. The metadata area is located out-of-band of NVDIMM storage
154area hence a separate access semantics is provided.
155
156**H_SCM_WRITE_METADATA**
157
158| Input: *drcIndex, offset, data, numBytesToWrite*
159| Out: *None*
160| Return Value: *H_Success, H_Parameter, H_P2, H_P4, H_Hardware*
161
162Given a DRC Index of an NVDIMM, write N-bytes to the metadata area
163associated with it, at the specified offset and from the provided buffer.
164
165**H_SCM_BIND_MEM**
166
167| Input: *drcIndex, startingScmBlockIndex, numScmBlocksToBind,*
168| *targetLogicalMemoryAddress, continue-token*
169| Out: *continue-token, targetLogicalMemoryAddress, numScmBlocksToBound*
170| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_P4, H_Overlap,*
171| *H_Too_Big, H_P5, H_Busy*
172
173Given a DRC-Index of an NVDIMM, map a continuous SCM blocks range
174*(startingScmBlockIndex, startingScmBlockIndex+numScmBlocksToBind)* to the guest
175at *targetLogicalMemoryAddress* within guest physical address space. In
176case *targetLogicalMemoryAddress == 0xFFFFFFFF_FFFFFFFF* then hypervisor
177assigns a target address to the guest. The HCALL can fail if the Guest has
178an active PTE entry to the SCM block being bound.
179
180**H_SCM_UNBIND_MEM**
181| Input: drcIndex, startingScmLogicalMemoryAddress, numScmBlocksToUnbind
182| Out: numScmBlocksUnbound
183| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Overlap,*
184| *H_Busy, H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
185
186Given a DRC-Index of an NVDimm, unmap *numScmBlocksToUnbind* SCM blocks starting
187at *startingScmLogicalMemoryAddress* from guest physical address space. The
188HCALL can fail if the Guest has an active PTE entry to the SCM block being
189unbound.
190
191**H_SCM_QUERY_BLOCK_MEM_BINDING**
192
193| Input: *drcIndex, scmBlockIndex*
194| Out: *Guest-Physical-Address*
195| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
196
197Given a DRC-Index and an SCM Block index return the guest physical address to
198which the SCM block is mapped to.
199
200**H_SCM_QUERY_LOGICAL_MEM_BINDING**
201
202| Input: *Guest-Physical-Address*
203| Out: *drcIndex, scmBlockIndex*
204| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
205
206Given a guest physical address return which DRC Index and SCM block is mapped
207to that address.
208
209**H_SCM_UNBIND_ALL**
210
211| Input: *scmTargetScope, drcIndex*
212| Out: *None*
213| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Busy,*
214| *H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
215
216Depending on the Target scope unmap all SCM blocks belonging to all NVDIMMs
217or all SCM blocks belonging to a single NVDIMM identified by its drcIndex
218from the LPAR memory.
219
220**H_SCM_HEALTH**
221
222| Input: drcIndex
223| Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)*
224| Return Value: *H_Success, H_Parameter, H_Hardware*
225
226Given a DRC Index return the info on predictive failure and overall health of
227the PMEM device. The asserted bits in the health-bitmap indicate one or more states
228(described in table below) of the PMEM device and health-bit-valid-bitmap indicate
229which bits in health-bitmap are valid. The bits are reported in
230reverse bit ordering for example a value of 0xC400000000000000
231indicates bits 0, 1, and 5 are valid.
232
233Health Bitmap Flags:
234
235+------+-----------------------------------------------------------------------+
236|  Bit |               Definition                                              |
237+======+=======================================================================+
238|  00  | PMEM device is unable to persist memory contents.                     |
239|      | If the system is powered down, nothing will be saved.                 |
240+------+-----------------------------------------------------------------------+
241|  01  | PMEM device failed to persist memory contents. Either contents were   |
242|      | not saved successfully on power down or were not restored properly on |
243|      | power up.                                                             |
244+------+-----------------------------------------------------------------------+
245|  02  | PMEM device contents are persisted from previous IPL. The data from   |
246|      | the last boot were successfully restored.                             |
247+------+-----------------------------------------------------------------------+
248|  03  | PMEM device contents are not persisted from previous IPL. There was no|
249|      | data to restore from the last boot.                                   |
250+------+-----------------------------------------------------------------------+
251|  04  | PMEM device memory life remaining is critically low                   |
252+------+-----------------------------------------------------------------------+
253|  05  | PMEM device will be garded off next IPL due to failure                |
254+------+-----------------------------------------------------------------------+
255|  06  | PMEM device contents cannot persist due to current platform health    |
256|      | status. A hardware failure may prevent data from being saved or       |
257|      | restored.                                                             |
258+------+-----------------------------------------------------------------------+
259|  07  | PMEM device is unable to persist memory contents in certain conditions|
260+------+-----------------------------------------------------------------------+
261|  08  | PMEM device is encrypted                                              |
262+------+-----------------------------------------------------------------------+
263|  09  | PMEM device has successfully completed a requested erase or secure    |
264|      | erase procedure.                                                      |
265+------+-----------------------------------------------------------------------+
266|10:63 | Reserved / Unused                                                     |
267+------+-----------------------------------------------------------------------+
268
269**H_SCM_PERFORMANCE_STATS**
270
271| Input: drcIndex, resultBuffer Addr
272| Out: None
273| Return Value:  *H_Success, H_Parameter, H_Unsupported, H_Hardware, H_Authority, H_Privilege*
274
275Given a DRC Index collect the performance statistics for NVDIMM and copy them
276to the resultBuffer.
277
278**H_SCM_FLUSH**
279
280| Input: *drcIndex, continue-token*
281| Out: *continue-token*
282| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
283
284Given a DRC Index Flush the data to backend NVDIMM device.
285
286The hcall returns H_BUSY when the flush takes longer time and the hcall needs
287to be issued multiple times in order to be completely serviced. The
288*continue-token* from the output to be passed in the argument list of
289subsequent hcalls to the hypervisor until the hcall is completely serviced
290at which point H_SUCCESS or other error is returned by the hypervisor.
291
292**H_HTM**
293
294| Input: flags, target, operation (op), op-param1, op-param2, op-param3
295| Out: *dumphtmbufferdata*
296| Return Value: *H_Success,H_Busy,H_LongBusyOrder,H_Partial,H_Parameter,
297		 H_P2,H_P3,H_P4,H_P5,H_P6,H_State,H_Not_Available,H_Authority*
298
299H_HTM supports setup, configuration, control and dumping of Hardware Trace
300Macro (HTM) function and its data. HTM buffer stores tracing data for functions
301like core instruction, core LLAT and nest.
302
303References
304==========
305.. [1] "Power Architecture Platform Reference"
306       https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference
307.. [2] "Linux on Power Architecture Platform Reference"
308       https://members.openpowerfoundation.org/document/dl/469
309.. [3] "Definitions and Notation" Book III-Section 14.5.3
310       https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
311.. [4] arch/powerpc/include/asm/hvcall.h
312.. [5] "64-Bit ELF V2 ABI Specification: Power Architecture"
313       https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture
314