1 Using Multiple ``IOThread``\ s 2 ============================== 3 4 .. 5 Copyright (c) 2014-2017 Red Hat Inc. 6 7 This work is licensed under the terms of the GNU GPL, version 2 or later. See 8 the COPYING file in the top-level directory. 9 10 11 This document explains the ``IOThread`` feature and how to write code that runs 12 outside the BQL. 13 14 The main loop and ``IOThread``\ s 15 --------------------------------- 16 QEMU is an event-driven program that can do several things at once using an 17 event loop. The VNC server and the QMP monitor are both processed from the 18 same event loop, which monitors their file descriptors until they become 19 readable and then invokes a callback. 20 21 The default event loop is called the main loop (see ``main-loop.c``). It is 22 possible to create additional event loop threads using 23 ``-object iothread,id=my-iothread``. 24 25 Side note: The main loop and ``IOThread`` are both event loops but their code is 26 not shared completely. Sometimes it is useful to remember that although they 27 are conceptually similar they are currently not interchangeable. 28 29 Why ``IOThread``\ s are useful 30 ------------------------------ 31 ``IOThread``\ s allow the user to control the placement of work. The main loop is a 32 scalability bottleneck on hosts with many CPUs. Work can be spread across 33 several ``IOThread``\ s instead of just one main loop. When set up correctly this 34 can improve I/O latency and reduce jitter seen by the guest. 35 36 The main loop is also deeply associated with the BQL, which is a 37 scalability bottleneck in itself. vCPU threads and the main loop use the BQL 38 to serialize execution of QEMU code. This mutex is necessary because a lot of 39 QEMU's code historically was not thread-safe. 40 41 The fact that all I/O processing is done in a single main loop and that the 42 BQL is contended by all vCPU threads and the main loop explain 43 why it is desirable to place work into ``IOThread``\ s. 44 45 The experimental ``virtio-blk`` data-plane implementation has been benchmarked and 46 shows these effects: 47 ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf 48 49 .. _how-to-program: 50 51 How to program for ``IOThread``\ s 52 ---------------------------------- 53 The main difference between legacy code and new code that can run in an 54 ``IOThread`` is dealing explicitly with the event loop object, ``AioContext`` 55 (see ``include/block/aio.h``). Code that only works in the main loop 56 implicitly uses the main loop's ``AioContext``. Code that supports running 57 in ``IOThread``\ s must be aware of its ``AioContext``. 58 59 AioContext supports the following services: 60 * File descriptor monitoring (read/write/error on POSIX hosts) 61 * Event notifiers (inter-thread signalling) 62 * Timers 63 * Bottom Halves (BH) deferred callbacks 64 65 There are several old APIs that use the main loop AioContext: 66 * LEGACY ``qemu_aio_set_fd_handler()`` - monitor a file descriptor 67 * LEGACY ``qemu_aio_set_event_notifier()`` - monitor an event notifier 68 * LEGACY ``timer_new_ms()`` - create a timer 69 * LEGACY ``qemu_bh_new()`` - create a BH 70 * LEGACY ``qemu_bh_new_guarded()`` - create a BH with a device re-entrancy guard 71 * LEGACY ``qemu_aio_wait()`` - run an event loop iteration 72 73 Since they implicitly work on the main loop they cannot be used in code that 74 runs in an ``IOThread``. They might cause a crash or deadlock if called from an 75 ``IOThread`` since the BQL is not held. 76 77 Instead, use the ``AioContext`` functions directly (see ``include/block/aio.h``): 78 * ``aio_set_fd_handler()`` - monitor a file descriptor 79 * ``aio_set_event_notifier()`` - monitor an event notifier 80 * ``aio_timer_new()`` - create a timer 81 * ``aio_bh_new()`` - create a BH 82 * ``aio_bh_new_guarded()`` - create a BH with a device re-entrancy guard 83 * ``aio_poll()`` - run an event loop iteration 84 85 The ``qemu_bh_new_guarded``/``aio_bh_new_guarded`` APIs accept a 86 ``MemReentrancyGuard`` 87 argument, which is used to check for and prevent re-entrancy problems. For 88 BHs associated with devices, the reentrancy-guard is contained in the 89 corresponding ``DeviceState`` and named ``mem_reentrancy_guard``. 90 91 The ``AioContext`` can be obtained from the ``IOThread`` using 92 ``iothread_get_aio_context()`` or for the main loop using 93 ``qemu_get_aio_context()``. Code that takes an ``AioContext`` argument 94 works both in ``IOThread``\ s or the main loop, depending on which ``AioContext`` 95 instance the caller passes in. 96 97 How to synchronize with an ``IOThread`` 98 --------------------------------------- 99 Variables that can be accessed by multiple threads require some form of 100 synchronization such as ``qemu_mutex_lock()``, ``rcu_read_lock()``, etc. 101 102 ``AioContext`` functions like ``aio_set_fd_handler()``, 103 ``aio_set_event_notifier()``, ``aio_bh_new()``, and ``aio_timer_new()`` 104 are thread-safe. They can be used to trigger activity in an ``IOThread``. 105 106 Side note: the best way to schedule a function call across threads is to call 107 ``aio_bh_schedule_oneshot()``. 108 109 The main loop thread can wait synchronously for a condition using 110 ``AIO_WAIT_WHILE()``. 111 112 ``AioContext`` and the block layer 113 ---------------------------------- 114 The ``AioContext`` originates from the QEMU block layer, even though nowadays 115 ``AioContext`` is a generic event loop that can be used by any QEMU subsystem. 116 117 The block layer has support for ``AioContext`` integrated. Each 118 ``BlockDriverState`` is associated with an ``AioContext`` using 119 ``bdrv_try_change_aio_context()`` and ``bdrv_get_aio_context()``. 120 This allows block layer code to process I/O inside the 121 right ``AioContext``. Other subsystems may wish to follow a similar approach. 122 123 Block layer code must therefore expect to run in an ``IOThread`` and avoid using 124 old APIs that implicitly use the main loop. See 125 `How to program for IOThreads`_ for information on how to do that. 126 127 Code running in the monitor typically needs to ensure that past 128 requests from the guest are completed. When a block device is running 129 in an ``IOThread``, the ``IOThread`` can also process requests from the guest 130 (via ioeventfd). To achieve both objects, wrap the code between 131 ``bdrv_drained_begin()`` and ``bdrv_drained_end()``, thus creating a "drained 132 section". 133 134 Long-running jobs (usually in the form of coroutines) are often scheduled in 135 the ``BlockDriverState``'s ``AioContext``. The functions 136 ``bdrv_add``/``remove_aio_context_notifier``, or alternatively 137 ``blk_add``/``remove_aio_context_notifier`` if you use ``BlockBackends``, 138 can be used to get a notification whenever ``bdrv_try_change_aio_context()`` 139 moves a ``BlockDriverState`` to a different ``AioContext``. 140