1.. SPDX-License-Identifier: GPL-2.0+
2
3======
4XArray
5======
6
7:Author: Matthew Wilcox
8
9Overview
10========
11
12The XArray is an abstract data type which behaves like a very large array
13of pointers.  It meets many of the same needs as a hash or a conventional
14resizable array.  Unlike a hash, it allows you to sensibly go to the
15next or previous entry in a cache-efficient manner.  In contrast to a
16resizable array, there is no need to copy data or change MMU mappings in
17order to grow the array.  It is more memory-efficient, parallelisable
18and cache friendly than a doubly-linked list.  It takes advantage of
19RCU to perform lookups without locking.
20
21The XArray implementation is efficient when the indices used are densely
22clustered; hashing the object and using the hash as the index will not
23perform well.  The XArray is optimised for small indices, but still has
24good performance with large indices.  If your index can be larger than
25``ULONG_MAX`` then the XArray is not the data type for you.  The most
26important user of the XArray is the page cache.
27
28Normal pointers may be stored in the XArray directly.  They must be 4-byte
29aligned, which is true for any pointer returned from kmalloc() and
30alloc_page().  It isn't true for arbitrary user-space pointers,
31nor for function pointers.  You can store pointers to statically allocated
32objects, as long as those objects have an alignment of at least 4.
33
34You can also store integers between 0 and ``LONG_MAX`` in the XArray.
35You must first convert it into an entry using xa_mk_value().
36When you retrieve an entry from the XArray, you can check whether it is
37a value entry by calling xa_is_value(), and convert it back to
38an integer by calling xa_to_value().
39
40Some users want to tag the pointers they store in the XArray.  You can
41call xa_tag_pointer() to create an entry with a tag, xa_untag_pointer()
42to turn a tagged entry back into an untagged pointer and xa_pointer_tag()
43to retrieve the tag of an entry.  Tagged pointers use the same bits that
44are used to distinguish value entries from normal pointers, so you must
45decide whether you want to store value entries or tagged pointers in any
46particular XArray.
47
48The XArray does not support storing IS_ERR() pointers as some
49conflict with value entries or internal entries.
50
51An unusual feature of the XArray is the ability to create entries which
52occupy a range of indices.  Once stored to, looking up any index in
53the range will return the same entry as looking up any other index in
54the range.  Storing to any index will store to all of them.  Multi-index
55entries can be explicitly split into smaller entries. Unsetting (using
56xa_erase() or xa_store() with ``NULL``) any entry will cause the XArray
57to forget about the range.
58
59Normal API
60==========
61
62Start by initialising an XArray, either with DEFINE_XARRAY()
63for statically allocated XArrays or xa_init() for dynamically
64allocated ones.  A freshly-initialised XArray contains a ``NULL``
65pointer at every index.
66
67You can then set entries using xa_store() and get entries using
68xa_load().  xa_store() will overwrite any entry with the new entry and
69return the previous entry stored at that index.  You can unset entries
70using xa_erase() or by setting the entry to ``NULL`` using xa_store().
71There is no difference between an entry that has never been stored to
72and one that has been erased with xa_erase(); an entry that has most
73recently had ``NULL`` stored to it is also equivalent except if the
74XArray was initialized with ``XA_FLAGS_ALLOC``.
75
76You can conditionally replace an entry at an index by using
77xa_cmpxchg().  Like cmpxchg(), it will only succeed if
78the entry at that index has the 'old' value.  It also returns the entry
79which was at that index; if it returns the same entry which was passed as
80'old', then xa_cmpxchg() succeeded.
81
82If you want to only store a new entry to an index if the current entry
83at that index is ``NULL``, you can use xa_insert() which
84returns ``-EBUSY`` if the entry is not empty.
85
86You can copy entries out of the XArray into a plain array by calling
87xa_extract().  Or you can iterate over the present entries in the XArray
88by calling xa_for_each(), xa_for_each_start() or xa_for_each_range().
89You may prefer to use xa_find() or xa_find_after() to move to the next
90present entry in the XArray.
91
92Calling xa_store_range() stores the same entry in a range
93of indices.  If you do this, some of the other operations will behave
94in a slightly odd way.  For example, marking the entry at one index
95may result in the entry being marked at some, but not all of the other
96indices.  Storing into one index may result in the entry retrieved by
97some, but not all of the other indices changing.
98
99Sometimes you need to ensure that a subsequent call to xa_store()
100will not need to allocate memory.  The xa_reserve() function
101will store a reserved entry at the indicated index.  Users of the
102normal API will see this entry as containing ``NULL``.  If you do
103not need to use the reserved entry, you can call xa_release()
104to remove the unused entry.  If another user has stored to the entry
105in the meantime, xa_release() will do nothing; if instead you
106want the entry to become ``NULL``, you should use xa_erase().
107Using xa_insert() on a reserved entry will fail.
108
109If all entries in the array are ``NULL``, the xa_empty() function
110will return ``true``.
111
112Finally, you can remove all entries from an XArray by calling
113xa_destroy().  If the XArray entries are pointers, you may wish
114to free the entries first.  You can do this by iterating over all present
115entries in the XArray using the xa_for_each() iterator.
116
117Search Marks
118------------
119
120Each entry in the array has three bits associated with it called marks.
121Each mark may be set or cleared independently of the others.  You can
122iterate over marked entries by using the xa_for_each_marked() iterator.
123
124You can enquire whether a mark is set on an entry by using
125xa_get_mark().  If the entry is not ``NULL``, you can set a mark on it
126by using xa_set_mark() and remove the mark from an entry by calling
127xa_clear_mark().  You can ask whether any entry in the XArray has a
128particular mark set by calling xa_marked().  Erasing an entry from the
129XArray causes all marks associated with that entry to be cleared.
130
131Setting or clearing a mark on any index of a multi-index entry will
132affect all indices covered by that entry.  Querying the mark on any
133index will return the same result.
134
135There is no way to iterate over entries which are not marked; the data
136structure does not allow this to be implemented efficiently.  There are
137not currently iterators to search for logical combinations of bits (eg
138iterate over all entries which have both ``XA_MARK_1`` and ``XA_MARK_2``
139set, or iterate over all entries which have ``XA_MARK_0`` or ``XA_MARK_2``
140set).  It would be possible to add these if a user arises.
141
142Allocating XArrays
143------------------
144
145If you use DEFINE_XARRAY_ALLOC() to define the XArray, or
146initialise it by passing ``XA_FLAGS_ALLOC`` to xa_init_flags(),
147the XArray changes to track whether entries are in use or not.
148
149You can call xa_alloc() to store the entry at an unused index
150in the XArray.  If you need to modify the array from interrupt context,
151you can use xa_alloc_bh() or xa_alloc_irq() to disable
152interrupts while allocating the ID.
153
154Using xa_store(), xa_cmpxchg() or xa_insert() will
155also mark the entry as being allocated.  Unlike a normal XArray, storing
156``NULL`` will mark the entry as being in use, like xa_reserve().
157To free an entry, use xa_erase() (or xa_release() if
158you only want to free the entry if it's ``NULL``).
159
160By default, the lowest free entry is allocated starting from 0.  If you
161want to allocate entries starting at 1, it is more efficient to use
162DEFINE_XARRAY_ALLOC1() or ``XA_FLAGS_ALLOC1``.  If you want to
163allocate IDs up to a maximum, then wrap back around to the lowest free
164ID, you can use xa_alloc_cyclic().
165
166You cannot use ``XA_MARK_0`` with an allocating XArray as this mark
167is used to track whether an entry is free or not.  The other marks are
168available for your use.
169
170Memory allocation
171-----------------
172
173The xa_store(), xa_cmpxchg(), xa_alloc(),
174xa_reserve() and xa_insert() functions take a gfp_t
175parameter in case the XArray needs to allocate memory to store this entry.
176If the entry is being deleted, no memory allocation needs to be performed,
177and the GFP flags specified will be ignored.
178
179It is possible for no memory to be allocatable, particularly if you pass
180a restrictive set of GFP flags.  In that case, the functions return a
181special value which can be turned into an errno using xa_err().
182If you don't need to know exactly which error occurred, using
183xa_is_err() is slightly more efficient.
184
185Locking
186-------
187
188When using the Normal API, you do not have to worry about locking.
189The XArray uses RCU and an internal spinlock to synchronise access:
190
191No lock needed:
192 * xa_empty()
193 * xa_marked()
194
195Takes RCU read lock:
196 * xa_load()
197 * xa_for_each()
198 * xa_for_each_start()
199 * xa_for_each_range()
200 * xa_find()
201 * xa_find_after()
202 * xa_extract()
203 * xa_get_mark()
204
205Takes xa_lock internally:
206 * xa_store()
207 * xa_store_bh()
208 * xa_store_irq()
209 * xa_insert()
210 * xa_insert_bh()
211 * xa_insert_irq()
212 * xa_erase()
213 * xa_erase_bh()
214 * xa_erase_irq()
215 * xa_cmpxchg()
216 * xa_cmpxchg_bh()
217 * xa_cmpxchg_irq()
218 * xa_store_range()
219 * xa_alloc()
220 * xa_alloc_bh()
221 * xa_alloc_irq()
222 * xa_reserve()
223 * xa_reserve_bh()
224 * xa_reserve_irq()
225 * xa_destroy()
226 * xa_set_mark()
227 * xa_clear_mark()
228
229Assumes xa_lock held on entry:
230 * __xa_store()
231 * __xa_insert()
232 * __xa_erase()
233 * __xa_cmpxchg()
234 * __xa_alloc()
235 * __xa_set_mark()
236 * __xa_clear_mark()
237
238If you want to take advantage of the lock to protect the data structures
239that you are storing in the XArray, you can call xa_lock()
240before calling xa_load(), then take a reference count on the
241object you have found before calling xa_unlock().  This will
242prevent stores from removing the object from the array between looking
243up the object and incrementing the refcount.  You can also use RCU to
244avoid dereferencing freed memory, but an explanation of that is beyond
245the scope of this document.
246
247The XArray does not disable interrupts or softirqs while modifying
248the array.  It is safe to read the XArray from interrupt or softirq
249context as the RCU lock provides enough protection.
250
251If, for example, you want to store entries in the XArray in process
252context and then erase them in softirq context, you can do that this way::
253
254    void foo_init(struct foo *foo)
255    {
256        xa_init_flags(&foo->array, XA_FLAGS_LOCK_BH);
257    }
258
259    int foo_store(struct foo *foo, unsigned long index, void *entry)
260    {
261        int err;
262
263        xa_lock_bh(&foo->array);
264        err = xa_err(__xa_store(&foo->array, index, entry, GFP_KERNEL));
265        if (!err)
266            foo->count++;
267        xa_unlock_bh(&foo->array);
268        return err;
269    }
270
271    /* foo_erase() is only called from softirq context */
272    void foo_erase(struct foo *foo, unsigned long index)
273    {
274        xa_lock(&foo->array);
275        __xa_erase(&foo->array, index);
276        foo->count--;
277        xa_unlock(&foo->array);
278    }
279
280If you are going to modify the XArray from interrupt or softirq context,
281you need to initialise the array using xa_init_flags(), passing
282``XA_FLAGS_LOCK_IRQ`` or ``XA_FLAGS_LOCK_BH``.
283
284The above example also shows a common pattern of wanting to extend the
285coverage of the xa_lock on the store side to protect some statistics
286associated with the array.
287
288Sharing the XArray with interrupt context is also possible, either
289using xa_lock_irqsave() in both the interrupt handler and process
290context, or xa_lock_irq() in process context and xa_lock()
291in the interrupt handler.  Some of the more common patterns have helper
292functions such as xa_store_bh(), xa_store_irq(),
293xa_erase_bh(), xa_erase_irq(), xa_cmpxchg_bh()
294and xa_cmpxchg_irq().
295
296Sometimes you need to protect access to the XArray with a mutex because
297that lock sits above another mutex in the locking hierarchy.  That does
298not entitle you to use functions like __xa_erase() without taking
299the xa_lock; the xa_lock is used for lockdep validation and will be used
300for other purposes in the future.
301
302The __xa_set_mark() and __xa_clear_mark() functions are also
303available for situations where you look up an entry and want to atomically
304set or clear a mark.  It may be more efficient to use the advanced API
305in this case, as it will save you from walking the tree twice.
306
307Advanced API
308============
309
310The advanced API offers more flexibility and better performance at the
311cost of an interface which can be harder to use and has fewer safeguards.
312No locking is done for you by the advanced API, and you are required
313to use the xa_lock while modifying the array.  You can choose whether
314to use the xa_lock or the RCU lock while doing read-only operations on
315the array.  You can mix advanced and normal operations on the same array;
316indeed the normal API is implemented in terms of the advanced API.  The
317advanced API is only available to modules with a GPL-compatible license.
318
319The advanced API is based around the xa_state.  This is an opaque data
320structure which you declare on the stack using the XA_STATE() macro.
321This macro initialises the xa_state ready to start walking around the
322XArray.  It is used as a cursor to maintain the position in the XArray
323and let you compose various operations together without having to restart
324from the top every time.  The contents of the xa_state are protected by
325the rcu_read_lock() or the xas_lock().  If you need to drop whichever of
326those locks is protecting your state and tree, you must call xas_pause()
327so that future calls do not rely on the parts of the state which were
328left unprotected.
329
330The xa_state is also used to store errors.  You can call
331xas_error() to retrieve the error.  All operations check whether
332the xa_state is in an error state before proceeding, so there's no need
333for you to check for an error after each call; you can make multiple
334calls in succession and only check at a convenient point.  The only
335errors currently generated by the XArray code itself are ``ENOMEM`` and
336``EINVAL``, but it supports arbitrary errors in case you want to call
337xas_set_err() yourself.
338
339If the xa_state is holding an ``ENOMEM`` error, calling xas_nomem()
340will attempt to allocate more memory using the specified gfp flags and
341cache it in the xa_state for the next attempt.  The idea is that you take
342the xa_lock, attempt the operation and drop the lock.  The operation
343attempts to allocate memory while holding the lock, but it is more
344likely to fail.  Once you have dropped the lock, xas_nomem()
345can try harder to allocate more memory.  It will return ``true`` if it
346is worth retrying the operation (i.e. that there was a memory error *and*
347more memory was allocated).  If it has previously allocated memory, and
348that memory wasn't used, and there is no error (or some error that isn't
349``ENOMEM``), then it will free the memory previously allocated.
350
351Internal Entries
352----------------
353
354The XArray reserves some entries for its own purposes.  These are never
355exposed through the normal API, but when using the advanced API, it's
356possible to see them.  Usually the best way to handle them is to pass them
357to xas_retry(), and retry the operation if it returns ``true``.
358
359.. flat-table::
360   :widths: 1 1 6
361
362   * - Name
363     - Test
364     - Usage
365
366   * - Node
367     - xa_is_node()
368     - An XArray node.  May be visible when using a multi-index xa_state.
369
370   * - Sibling
371     - xa_is_sibling()
372     - A non-canonical entry for a multi-index entry.  The value indicates
373       which slot in this node has the canonical entry.
374
375   * - Retry
376     - xa_is_retry()
377     - This entry is currently being modified by a thread which has the
378       xa_lock.  The node containing this entry may be freed at the end
379       of this RCU period.  You should restart the lookup from the head
380       of the array.
381
382   * - Zero
383     - xa_is_zero()
384     - Zero entries appear as ``NULL`` through the Normal API, but occupy
385       an entry in the XArray which can be used to reserve the index for
386       future use.  This is used by allocating XArrays for allocated entries
387       which are ``NULL``.
388
389Other internal entries may be added in the future.  As far as possible, they
390will be handled by xas_retry().
391
392Additional functionality
393------------------------
394
395The xas_create_range() function allocates all the necessary memory
396to store every entry in a range.  It will set ENOMEM in the xa_state if
397it cannot allocate memory.
398
399You can use xas_init_marks() to reset the marks on an entry
400to their default state.  This is usually all marks clear, unless the
401XArray is marked with ``XA_FLAGS_TRACK_FREE``, in which case mark 0 is set
402and all other marks are clear.  Replacing one entry with another using
403xas_store() will not reset the marks on that entry; if you want
404the marks reset, you should do that explicitly.
405
406The xas_load() will walk the xa_state as close to the entry
407as it can.  If you know the xa_state has already been walked to the
408entry and need to check that the entry hasn't changed, you can use
409xas_reload() to save a function call.
410
411If you need to move to a different index in the XArray, call
412xas_set().  This resets the cursor to the top of the tree, which
413will generally make the next operation walk the cursor to the desired
414spot in the tree.  If you want to move to the next or previous index,
415call xas_next() or xas_prev().  Setting the index does
416not walk the cursor around the array so does not require a lock to be
417held, while moving to the next or previous index does.
418
419You can search for the next present entry using xas_find().  This
420is the equivalent of both xa_find() and xa_find_after();
421if the cursor has been walked to an entry, then it will find the next
422entry after the one currently referenced.  If not, it will return the
423entry at the index of the xa_state.  Using xas_next_entry() to
424move to the next present entry instead of xas_find() will save
425a function call in the majority of cases at the expense of emitting more
426inline code.
427
428The xas_find_marked() function is similar.  If the xa_state has
429not been walked, it will return the entry at the index of the xa_state,
430if it is marked.  Otherwise, it will return the first marked entry after
431the entry referenced by the xa_state.  The xas_next_marked()
432function is the equivalent of xas_next_entry().
433
434When iterating over a range of the XArray using xas_for_each()
435or xas_for_each_marked(), it may be necessary to temporarily stop
436the iteration.  The xas_pause() function exists for this purpose.
437After you have done the necessary work and wish to resume, the xa_state
438is in an appropriate state to continue the iteration after the entry
439you last processed.  If you have interrupts disabled while iterating,
440then it is good manners to pause the iteration and reenable interrupts
441every ``XA_CHECK_SCHED`` entries.
442
443The xas_get_mark(), xas_set_mark() and xas_clear_mark() functions require
444the xa_state cursor to have been moved to the appropriate location in the
445XArray; they will do nothing if you have called xas_pause() or xas_set()
446immediately before.
447
448You can call xas_set_update() to have a callback function
449called each time the XArray updates a node.  This is used by the page
450cache workingset code to maintain its list of nodes which contain only
451shadow entries.
452
453Multi-Index Entries
454-------------------
455
456The XArray has the ability to tie multiple indices together so that
457operations on one index affect all indices.  For example, storing into
458any index will change the value of the entry retrieved from any index.
459Setting or clearing a mark on any index will set or clear the mark
460on every index that is tied together.  The current implementation
461only allows tying ranges which are aligned powers of two together;
462eg indices 64-127 may be tied together, but 2-6 may not be.  This may
463save substantial quantities of memory; for example tying 512 entries
464together will save over 4kB.
465
466You can create a multi-index entry by using XA_STATE_ORDER()
467or xas_set_order() followed by a call to xas_store().
468Calling xas_load() with a multi-index xa_state will walk the
469xa_state to the right location in the tree, but the return value is not
470meaningful, potentially being an internal entry or ``NULL`` even when there
471is an entry stored within the range.  Calling xas_find_conflict()
472will return the first entry within the range or ``NULL`` if there are no
473entries in the range.  The xas_for_each_conflict() iterator will
474iterate over every entry which overlaps the specified range.
475
476If xas_load() encounters a multi-index entry, the xa_index
477in the xa_state will not be changed.  When iterating over an XArray
478or calling xas_find(), if the initial index is in the middle
479of a multi-index entry, it will not be altered.  Subsequent calls
480or iterations will move the index to the first index in the range.
481Each entry will only be returned once, no matter how many indices it
482occupies.
483
484Using xas_next() or xas_prev() with a multi-index xa_state is not
485supported.  Using either of these functions on a multi-index entry will
486reveal sibling entries; these should be skipped over by the caller.
487
488Storing ``NULL`` into any index of a multi-index entry will set the
489entry at every index to ``NULL`` and dissolve the tie.  A multi-index
490entry can be split into entries occupying smaller ranges by calling
491xas_split_alloc() without the xa_lock held, followed by taking the lock
492and calling xas_split() or calling xas_try_split() with xa_lock. The
493difference between xas_split_alloc()+xas_split() and xas_try_alloc() is
494that xas_split_alloc() + xas_split() split the entry from the original
495order to the new order in one shot uniformly, whereas xas_try_split()
496iteratively splits the entry containing the index non-uniformly.
497For example, to split an order-9 entry, which takes 2^(9-6)=8 slots,
498assuming ``XA_CHUNK_SHIFT`` is 6, xas_split_alloc() + xas_split() need
4998 xa_node. xas_try_split() splits the order-9 entry into
5002 order-8 entries, then split one order-8 entry, based on the given index,
501to 2 order-7 entries, ..., and split one order-1 entry to 2 order-0 entries.
502When splitting the order-6 entry and a new xa_node is needed, xas_try_split()
503will try to allocate one if possible. As a result, xas_try_split() would only
504need 1 xa_node instead of 8.
505
506Functions and structures
507========================
508
509.. kernel-doc:: include/linux/xarray.h
510.. kernel-doc:: lib/xarray.c
511