Documentation/filesystems/path-lookup.rst

12 It has subsequently been updated to reflect changes in the kernel
22 exploration is needed to discover, is that it is complex.  There are
25 acquainted with such complexity and has tools to help manage it.  One
51 It is tempting to describe the second kind as starting with a
53 slashes and components, it can be empty, in other words.  This is
55 in Linux permit it when the ``AT_EMPTY_PATH`` flag is given.  For
57 can execute it by calling `execveat() <execveat_>`_ passing
62 it must identify a directory that already exists, otherwise an error
66 calls interpret it quite differently (e.g. some create it, some do
67 not), but it might not even exist: neither the empty pathname nor the
68 pathname that is just slashes have a final component.  If it does
69 exist, it could be "``.``" or "``..``" which are handled quite differently
74 If a pathname ends with a slash, such as "``/tmp/foo/``" it might be
91 checking that the trailing slash is not used where it isn't
92 permitted.  It also addresses the important issue of concurrent
119 that will be particularly relevant is that it is closely integrated
157 afraid of taking a lock when one is needed.  It uses a variety of
174 will behave as expected.  It also protects the ``->d_inode`` reference
184 setting ``d_inode`` to ``NULL``, or by removing it from the hash table
186 If the dentry is still in use the second option is used as it is
187 perfectly legal to keep using an open file after it has been deleted
190 ``d_inode`` be set to ``NULL``.  Doing it this way is more efficient for a
202 name (``d_name``) cannot be changed, and it cannot be removed from the
206 each candidate dentry that it finds in the hash table and then checks
207 that the parent and name are correct.  So it doesn't lock the parent
208 while searching in the cache; it only locks children.
212 but it first tries a more lightweight approach.  As seen in
229 it might end up continuing the search down the wrong chain,
233 from happening, but only to detect when it happens.
235 renamed.  If ``d_lookup`` finds that a rename happened while it
236 unsuccessfully scanned a chain in the hash table, it simply tries
251 cannot both happen at the same time.  It also keeps the directory
264 The semaphore affects pathname lookup in two distinct ways.  Firstly it
273 Secondly, when pathname lookup reaches the final component, it will
288 dentry, stores the required name and parent in it, checks if there
294 returned and the caller can know that it lost a race with some other
297 detect this from the presence of ``DCACHE_PAR_LOOKUP``.  In this case it
298 knows that it has won any race and now is responsible for asking the
300 the lookup is complete, it must call ``d_lookup_done()`` which clears
302 dentry from the secondary hash table - it will normally have been
309 ``d_alloc_parallel()`` has a little more work to do. It first waits for
312 will be woken by the call to ``d_lookup_done()``.  It then checks to see
313 if the dentry has now been added to the primary hash table.  If it
314 has, the dentry is returned and the caller just sees that it lost any
315 race.  If it hasn't been added to the primary hash table, the most
325 Per-CPU here means that incrementing the count is cheap as it only
327 it needs to check with every CPU.  Taking a ``mnt_count`` reference
331 in particular, doesn't stabilize the link to the mounted-on dentry.  It
333 and it provides a reference to the root dentry of the mounted
340 ``mount_lock`` is a global seqlock, a bit like ``rename_lock``.  It can be used to
373 In particular it is held while scanning chains in the dcache hash
376 Bringing it together with ``struct nameidata``
415 only assigned the first time it is used, or when a non-standard root
417 only one root is in effect for the entire path walk, even if it races
420 It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
432 escape that subtree.  It works a bit like a local ``chroot()``.
445 it calls ``handle_dots()`` which does the necessary locking as already
446 described.  If it finds a ``LAST_NORM`` component it first calls
448 filesystem to revalidate the result if it is that sort of filesystem.
449 If that doesn't get a good result, it calls "``lookup_slow()``" which
456 reference to the new ``vfsmount`` which is only counted if it is
457 different from the previous ``vfsmount``.  It then calls
470 ``nd->last_type`` to refer to the final component of the path.  It does
477 ``path_parentat()`` is clearly the simplest - it just wraps a little bit
485 ``path_lookupat()`` is nearly as simple - it is used when an existing
486 object is wanted such as by ``stat()`` or ``chmod()``.  It essentially just
491 not try to revalidate the mounted filesystem.  It effectively
497 Finally ``path_openat()`` is used for the ``open()`` system call; it
503 not always, take ``i_rwsem``, depending on what it finds.
512 the final component, it must be a trailing slash.
521 On filesystems that require it, the lookup routines will call the
524 from a server.  In some cases it may find that there has been change
550 It can block to avoid races.  If an automount point is being
555 It can selectively allow only some processes to transit through a
556 mount point.  When a server process is managing automounts, it may
559 filesystem, which will then give it a special pass through
566 supports multiple filesystem namespaces, it is possible that the
583 communicate with server processes etc. but it should ultimately either
590 There is no new locking of import here and it is important that no
600 It is in many ways similar to REF-walk and the two share quite a bit
601 of code.  The significant difference in RCU-walk is how it allows for
606 refusing to handle a number of cases -- it instead falls back to
627 principle, but then it is really designed to work when there may well
631 parts of the filesystem tree, but in many parts it will be.  For the
632 other parts it is important that RCU-walk can quickly fall back to
636 as long as what it is looking for is in the cache and is stable.  It
638 and carefully watching where it is, to be sure it doesn't trip.  If it
640 isn't in the cache, then it tries to stop gracefully and switch to
646 This is an invariant that RCU-walk must guarantee.  It can only make
648 REF-walk could also have made if it were walking down the tree at the
651 RCU-walk finds it cannot stop gracefully, it simply gives up and
672 so it is very unlikely that there will be much, if any, benefit from
680 down a path.  The particular guarantee it provides is that the key
690 before taking references to the "next" dentry or vfsmount.  It also
694 Instead, it checks to see if a change has been made, and aborts or
695 retries if it has.
698 decisions that REF-walk could have made), it must make the checks at
710 is needed - which it usually is - RCU-walk must take a copy and then
723 instead.  Notably it does *not* use ``read_seqcount_retry()``, but
734 We already met the ``mount_lock`` seqlock when REF-walk used it to
736 it for that too, but for quite a bit more.
738 Instead of taking a counted reference to each ``vfsmount`` as it
744 relatively rare, it is reasonable to fall back on REF-walk any time
749 when the end of the path is reached.  It is also checked when stepping
751 ``follow_dotdot_rcu()``).  If it is ever found to have changed, the
755 If RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure
772 the required pattern, though it does so for three different cases.
786 twice, once to determine if it is NULL and once to verify access
789 access and it is stored in the ``inode`` field of ``nameidata`` from where
790 it can be safely accessed without further validation.
793 ``lookup_slow()`` being too slow and requiring locks.  It is in
800 revalidates the new ``seq`` number.  It then validates the old ``dentry``
810 A semaphore is a fairly heavyweight lock that can only be taken when it is
815 dentry that it is looking for, or it will find a dentry which
816 ``read_seqretry()`` won't validate.  In either case it will drop down to
819 Though ``rename_lock`` could be used by RCU-walk as it doesn't require
824 something in the dentry cache, whether it is really there or not, it
841 It is also called from ``complete_walk()`` when the lookup has reached
849 will return ``-ECHILD`` which will percolate up until it triggers a new
852 For those cases where ``unlazy_walk()`` is an option, it essentially
853 takes a reference on each of the pointers that it holds (vfsmount,
856 it, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk
861 already have one (often indirectly through another object), but it
863 all.  For ``dentry->d_lockref``, it is safe to increment the reference
864 counter to get a reference unless it has been explicitly marked as
868 For ``mnt->mnt_count`` it is safe to take a reference as long as
870 validation fails, it may *not* be safe to just drop that reference in
872 progressed too far.  So the code in ``legitimize_mnt()``, when it
873 finds that the reference it got might not be safe, checks the
875 correct, or if it should just decrement the count and pretend none of
884 file system might be included in RCU-walk, and it must know to be
890 In this case an extra "``MAY_NOT_BLOCK``" flag is passed so that it
891 knows not to sleep, but to return ``-ECHILD`` if it cannot complete
893 dentry, so it doesn't need to worry about further consistency checks.
894 However if it accesses any other filesystem data structures, it must
904 ``seq`` number from the ``nameidata``, so it needs to be extra careful
907 result is not NULL before using it.  This pattern can be seen in
920 switch to REF-walk for the rest of the path.  We also saw it earlier
921 in ``dget_parent()`` when following a "``..``" link.  It tries a quick way
927 if anything goes wrong it is much safer to just abort and try a more
930 The emphasis here is "try quickly and check".  It should probably be
934 this whole process is assuming something is safe when in reality it
962 "``readlink -f``" command does, though it also edits out "``.``" and
987    Because it's a latency and DoS issue too. We need to react well to
988    true loops, but also to "very deep" non-loops. It's not about memory
989    use, it's about users triggering unreasonable CPU resources.
996 at most 40 symlinks in any one path lookup.  It previously imposed a
1003 symlinks.  In many cases this will be sufficient.  If it isn't, a
1008 It might seem that the name remnants are all that needs to be stored on
1017 to external storage.  It is particularly important for RCU-walk to be
1019 it doesn't need to drop down into REF-walk.
1026 inode`` it typically allocates extra space to store private data (a
1035 construct the symlink content into that memory whenever it is needed.
1037 When the symlink is stored in the inode, it has the same lifetime as
1043 symlink is stored and it can be accessed directly whenever needed.
1051 significantly, needs to release that reference when it is finished
1052 with it.
1055 mode.  It does require making changes to memory, which is best avoided,
1056 but that isn't necessarily a big cost and it is better than dropping
1060 filesystem cannot successfully get a reference in RCU-walk mode, it
1066 RCU-walk mode as the rewrite is not quite complete.  It is likely that
1068 called in RCU-walk mode so it both (1) knows to be careful, and (2) has the
1071 all the data structures it references are safe to be accessed while
1077 complexity.  It requires a reference to the inode so that the
1087 provides an opaque "cookie" that must be passed to ``->put_link()`` so that it
1090 completely.  Only the filesystem knows what it is.
1104 with 40 entries it adds up to 1600 bytes total, which is less than
1105 half a page.  So it might seem like a lot, but is by no means
1109 part of the symlink that the other fields refer to.  It is the remnant
1126 called; it then gets the link from the filesystem.  Providing that
1139 It is most convenient to push the new symlink references onto the
1142 old symlink as it walks that last component.  So it is quite
1145 new symlink.  It is guided in this by two flags; ``WALK_GET``, which
1146 gives it permission to follow a symlink if it finds one, and
1147 ``WALK_PUT``, which tells it to release the current symlink after it has been
1174 something that looks like a symlink.  It is really a reference to the
1175 target file, not just the name of it.  When you ``readlink`` these
1176 objects you get a name that might refer to the same file - unless it
1189 following all symbolic links it finds, until it reaches the final
1192 ``last`` name if it doesn't exist or give an error if it does.  Other
1209 report that it is a symlink are ``lookup_last()``, ``mountpoint_last()``
1214 Of these, ``do_last()`` is the most interesting as it is used for
1223    it.  If the file was found in the dcache, then ``vfs_open()`` is used for
1225    the filesystem provides it) to combine the final lookup with the open, or
1252 We previously said of RCU-walk that it would "take no locks, increment
1264 Symlinks are different it seems.  Both reading a symlink (with ``readlink()``)
1270 It is not clear why this is the case; POSIX has little to say on the
1284 quite complex.  Trying to stay in RCU-walk while doing it is best
1285 avoided.  Fortunately it is often permitted to skip the ``atime``
1293 It is easy to test if an ``atime`` update is needed while in RCU-walk
1294 mode and, if it isn't, the update can be skipped and RCU-walk mode
1310 very early on.  If it is set, empty pathnames are not considered to be
1325 provided by the caller, so it shouldn't be released when it is no
1329 it had the right name but for some other reason.  This happens when
1363 as well as blocking ".." if it would jump outside the starting point.
1381 point, then the mount is triggered.  Some operations would trigger it
1384 it sets ``LOOKUP_AUTOMOUNT``, as does "``quotactl()``" and the handling of
1388 symlinks.  Some system calls set or clear it implicitly, while
1390 ``UMOUNT_NOFOLLOW`` to control it.  Its effect is similar to
1391 ``WALK_GET`` that we already met, but it is used in a different way.
1394 Various callers set this and it is also set when the final component
1401 if it knows that it will be asked to open or create the file soon.
1410 than even a couple of releases ago.  But that doesn't mean it is
1412 symlinks that are stored in the inode so, while it handles many ext4
1413 symlinks, it doesn't help with NFS, XFS, or Btrfs.  That support