xref: /linux/Documentation/filesystems/orangefs.rst (revision a1c613ae4c322ddd58d5a8539dbfba2a0380a8c0)
118ccb223SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
218ccb223SMauro Carvalho Chehab
318ccb223SMauro Carvalho Chehab========
474a552a1SMike MarshallORANGEFS
574a552a1SMike Marshall========
674a552a1SMike Marshall
774a552a1SMike MarshallOrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal
874a552a1SMike Marshallfor large storage problems faced by HPC, BigData, Streaming Video,
974a552a1SMike MarshallGenomics, Bioinformatics.
1074a552a1SMike Marshall
1174a552a1SMike MarshallOrangefs, originally called PVFS, was first developed in 1993 by
1274a552a1SMike MarshallWalt Ligon and Eric Blumer as a parallel file system for Parallel
1374a552a1SMike MarshallVirtual Machine (PVM) as part of a NASA grant to study the I/O patterns
1474a552a1SMike Marshallof parallel programs.
1574a552a1SMike Marshall
1674a552a1SMike MarshallOrangefs features include:
1774a552a1SMike Marshall
1874a552a1SMike Marshall  * Distributes file data among multiple file servers
1974a552a1SMike Marshall  * Supports simultaneous access by multiple clients
2074a552a1SMike Marshall  * Stores file data and metadata on servers using local file system
2174a552a1SMike Marshall    and access methods
2274a552a1SMike Marshall  * Userspace implementation is easy to install and maintain
2374a552a1SMike Marshall  * Direct MPI support
2474a552a1SMike Marshall  * Stateless
2574a552a1SMike Marshall
2674a552a1SMike Marshall
2718ccb223SMauro Carvalho ChehabMailing List Archives
288e9ba5c4SMike Marshall=====================
2974a552a1SMike Marshall
308e9ba5c4SMike Marshallhttp://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
318e9ba5c4SMike Marshall
328e9ba5c4SMike Marshall
3318ccb223SMauro Carvalho ChehabMailing List Submissions
348e9ba5c4SMike Marshall========================
358e9ba5c4SMike Marshall
368e9ba5c4SMike Marshalldevel@lists.orangefs.org
3774a552a1SMike Marshall
3874a552a1SMike Marshall
3918ccb223SMauro Carvalho ChehabDocumentation
4074a552a1SMike Marshall=============
4174a552a1SMike Marshall
4274a552a1SMike Marshallhttp://www.orangefs.org/documentation/
4374a552a1SMike Marshall
4418ccb223SMauro Carvalho ChehabRunning ORANGEFS On a Single Server
45dd098022SMartin Brandenburg===================================
4674a552a1SMike Marshall
47dd098022SMartin BrandenburgOrangeFS is usually run in large installations with multiple servers and
48dd098022SMartin Brandenburgclients, but a complete filesystem can be run on a single machine for
49dd098022SMartin Brandenburgdevelopment and testing.
50dd098022SMartin Brandenburg
5118ccb223SMauro Carvalho ChehabOn Fedora, install orangefs and orangefs-server::
52dd098022SMartin Brandenburg
53dd098022SMartin Brandenburg    dnf -y install orangefs orangefs-server
54dd098022SMartin Brandenburg
55dd098022SMartin BrandenburgThere is an example server configuration file in
56dd098022SMartin Brandenburg/etc/orangefs/orangefs.conf.  Change localhost to your hostname if
57dd098022SMartin Brandenburgnecessary.
58dd098022SMartin Brandenburg
59dd098022SMartin BrandenburgTo generate a filesystem to run xfstests against, see below.
60dd098022SMartin Brandenburg
61dd098022SMartin BrandenburgThere is an example client configuration file in /etc/pvfs2tab.  It is a
62dd098022SMartin Brandenburgsingle line.  Uncomment it and change the hostname if necessary.  This
63dd098022SMartin Brandenburgcontrols clients which use libpvfs2.  This does not control the
64dd098022SMartin Brandenburgpvfs2-client-core.
65dd098022SMartin Brandenburg
6618ccb223SMauro Carvalho ChehabCreate the filesystem::
67dd098022SMartin Brandenburg
68dd098022SMartin Brandenburg    pvfs2-server -f /etc/orangefs/orangefs.conf
69dd098022SMartin Brandenburg
7018ccb223SMauro Carvalho ChehabStart the server::
71dd098022SMartin Brandenburg
72dd098022SMartin Brandenburg    systemctl start orangefs-server
73dd098022SMartin Brandenburg
7418ccb223SMauro Carvalho ChehabTest the server::
75dd098022SMartin Brandenburg
76dd098022SMartin Brandenburg    pvfs2-ping -m /pvfsmnt
77dd098022SMartin Brandenburg
78dd098022SMartin BrandenburgStart the client.  The module must be compiled in or loaded before this
7918ccb223SMauro Carvalho Chehabpoint::
80dd098022SMartin Brandenburg
81dd098022SMartin Brandenburg    systemctl start orangefs-client
82dd098022SMartin Brandenburg
8318ccb223SMauro Carvalho ChehabMount the filesystem::
84dd098022SMartin Brandenburg
85dd098022SMartin Brandenburg    mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
86dd098022SMartin Brandenburg
874e4bdcfaSLinus TorvaldsUserspace Filesystem Source
884e4bdcfaSLinus Torvalds===========================
894e4bdcfaSLinus Torvalds
904e4bdcfaSLinus Torvaldshttp://www.orangefs.org/download
914e4bdcfaSLinus Torvalds
924e4bdcfaSLinus TorvaldsOrangefs versions prior to 2.9.3 would not be compatible with the
934e4bdcfaSLinus Torvaldsupstream version of the kernel client.
944e4bdcfaSLinus Torvalds
95dd098022SMartin Brandenburg
9618ccb223SMauro Carvalho ChehabBuilding ORANGEFS on a Single Server
97dd098022SMartin Brandenburg====================================
98dd098022SMartin Brandenburg
99dd098022SMartin BrandenburgWhere OrangeFS cannot be installed from distribution packages, it may be
100dd098022SMartin Brandenburgbuilt from source.
101dd098022SMartin Brandenburg
102dd098022SMartin BrandenburgYou can omit --prefix if you don't care that things are sprinkled around
103dd098022SMartin Brandenburgin /usr/local.  As of version 2.9.6, OrangeFS uses Berkeley DB by
104dd098022SMartin Brandenburgdefault, we will probably be changing the default to LMDB soon.
10574a552a1SMike Marshall
10618ccb223SMauro Carvalho Chehab::
10718ccb223SMauro Carvalho Chehab
1084e4bdcfaSLinus Torvalds    ./configure --prefix=/opt/ofs --with-db-backend=lmdb --disable-usrint
10974a552a1SMike Marshall
11074a552a1SMike Marshall    make
11174a552a1SMike Marshall
11274a552a1SMike Marshall    make install
11374a552a1SMike Marshall
1144e4bdcfaSLinus TorvaldsCreate an orangefs config file by running pvfs2-genconfig and
1154e4bdcfaSLinus Torvaldsspecifying a target config file. Pvfs2-genconfig will prompt you
1164e4bdcfaSLinus Torvaldsthrough. Generally it works fine to take the defaults, but you
1174e4bdcfaSLinus Torvaldsshould use your server's hostname, rather than "localhost" when
1184e4bdcfaSLinus Torvaldsit comes to that question::
119dd098022SMartin Brandenburg
12074a552a1SMike Marshall    /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
12174a552a1SMike Marshall
122920af1ceSStephen KittCreate an /etc/pvfs2tab file (localhost is fine)::
1234e4bdcfaSLinus Torvalds
124dd098022SMartin Brandenburg    echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
125dd098022SMartin Brandenburg	/etc/pvfs2tab
12674a552a1SMike Marshall
12718ccb223SMauro Carvalho ChehabCreate the mount point you specified in the tab file if needed::
12874a552a1SMike Marshall
129dd098022SMartin Brandenburg    mkdir /pvfsmnt
13074a552a1SMike Marshall
13118ccb223SMauro Carvalho ChehabBootstrap the server::
132dd098022SMartin Brandenburg
133dd098022SMartin Brandenburg    /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
134dd098022SMartin Brandenburg
13518ccb223SMauro Carvalho ChehabStart the server::
136dd098022SMartin Brandenburg
1374e4bdcfaSLinus Torvalds    /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf
13874a552a1SMike Marshall
1398e9ba5c4SMike MarshallNow the server should be running. Pvfs2-ls is a simple
14018ccb223SMauro Carvalho Chehabtest to verify that the server is running::
14174a552a1SMike Marshall
142dd098022SMartin Brandenburg    /opt/ofs/bin/pvfs2-ls /pvfsmnt
14374a552a1SMike Marshall
1448e9ba5c4SMike MarshallIf stuff seems to be working, load the kernel module and
14518ccb223SMauro Carvalho Chehabturn on the client core::
14674a552a1SMike Marshall
1474e4bdcfaSLinus Torvalds    /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbin/pvfs2-client-core
148dd098022SMartin Brandenburg
14918ccb223SMauro Carvalho ChehabMount your filesystem::
150dd098022SMartin Brandenburg
1514e4bdcfaSLinus Torvalds    mount -t pvfs2 tcp://`hostname`:3334/orangefs /pvfsmnt
152dd098022SMartin Brandenburg
153dd098022SMartin Brandenburg
15418ccb223SMauro Carvalho ChehabRunning xfstests
155dd098022SMartin Brandenburg================
156dd098022SMartin Brandenburg
157dd098022SMartin BrandenburgIt is useful to use a scratch filesystem with xfstests.  This can be
158dd098022SMartin Brandenburgdone with only one server.
159dd098022SMartin Brandenburg
160dd098022SMartin BrandenburgMake a second copy of the FileSystem section in the server configuration
161dd098022SMartin Brandenburgfile, which is /etc/orangefs/orangefs.conf.  Change the Name to scratch.
162dd098022SMartin BrandenburgChange the ID to something other than the ID of the first FileSystem
163dd098022SMartin Brandenburgsection (2 is usually a good choice).
164dd098022SMartin Brandenburg
165dd098022SMartin BrandenburgThen there are two FileSystem sections: orangefs and scratch.
166dd098022SMartin Brandenburg
167dd098022SMartin BrandenburgThis change should be made before creating the filesystem.
168dd098022SMartin Brandenburg
16918ccb223SMauro Carvalho Chehab::
17018ccb223SMauro Carvalho Chehab
171dd098022SMartin Brandenburg    pvfs2-server -f /etc/orangefs/orangefs.conf
172dd098022SMartin Brandenburg
17318ccb223SMauro Carvalho ChehabTo run xfstests, create /etc/xfsqa.config::
174dd098022SMartin Brandenburg
175dd098022SMartin Brandenburg    TEST_DIR=/orangefs
176dd098022SMartin Brandenburg    TEST_DEV=tcp://localhost:3334/orangefs
177dd098022SMartin Brandenburg    SCRATCH_MNT=/scratch
178dd098022SMartin Brandenburg    SCRATCH_DEV=tcp://localhost:3334/scratch
179dd098022SMartin Brandenburg
18018ccb223SMauro Carvalho ChehabThen xfstests can be run::
181dd098022SMartin Brandenburg
182dd098022SMartin Brandenburg    ./check -pvfs2
18374a552a1SMike Marshall
18474a552a1SMike Marshall
18518ccb223SMauro Carvalho ChehabOptions
18674a552a1SMike Marshall=======
18774a552a1SMike Marshall
18874a552a1SMike MarshallThe following mount options are accepted:
18974a552a1SMike Marshall
19074a552a1SMike Marshall  acl
19174a552a1SMike Marshall    Allow the use of Access Control Lists on files and directories.
19274a552a1SMike Marshall
19374a552a1SMike Marshall  intr
19474a552a1SMike Marshall    Some operations between the kernel client and the user space
19574a552a1SMike Marshall    filesystem can be interruptible, such as changes in debug levels
19674a552a1SMike Marshall    and the setting of tunable parameters.
19774a552a1SMike Marshall
19874a552a1SMike Marshall  local_lock
19974a552a1SMike Marshall    Enable posix locking from the perspective of "this" kernel. The
20074a552a1SMike Marshall    default file_operations lock action is to return ENOSYS. Posix
20174a552a1SMike Marshall    locking kicks in if the filesystem is mounted with -o local_lock.
20274a552a1SMike Marshall    Distributed locking is being worked on for the future.
20374a552a1SMike Marshall
20474a552a1SMike Marshall
20518ccb223SMauro Carvalho ChehabDebugging
20674a552a1SMike Marshall=========
20774a552a1SMike Marshall
208fcac9d57SMike MarshallIf you want the debug (GOSSIP) statements in a particular
20918ccb223SMauro Carvalho Chehabsource file (inode.c for example) go to syslog::
21074a552a1SMike Marshall
21174a552a1SMike Marshall  echo inode > /sys/kernel/debug/orangefs/kernel-debug
21274a552a1SMike Marshall
21318ccb223SMauro Carvalho ChehabNo debugging (the default)::
21474a552a1SMike Marshall
21574a552a1SMike Marshall  echo none > /sys/kernel/debug/orangefs/kernel-debug
21674a552a1SMike Marshall
21718ccb223SMauro Carvalho ChehabDebugging from several source files::
21874a552a1SMike Marshall
21974a552a1SMike Marshall  echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
22074a552a1SMike Marshall
22118ccb223SMauro Carvalho ChehabAll debugging::
22274a552a1SMike Marshall
22374a552a1SMike Marshall  echo all > /sys/kernel/debug/orangefs/kernel-debug
22474a552a1SMike Marshall
22518ccb223SMauro Carvalho ChehabGet a list of all debugging keywords::
22674a552a1SMike Marshall
22774a552a1SMike Marshall  cat /sys/kernel/debug/orangefs/debug-help
228fcac9d57SMike Marshall
229fcac9d57SMike Marshall
23018ccb223SMauro Carvalho ChehabProtocol between Kernel Module and Userspace
231fcac9d57SMike Marshall============================================
232fcac9d57SMike Marshall
233fcac9d57SMike MarshallOrangefs is a user space filesystem and an associated kernel module.
234fcac9d57SMike MarshallWe'll just refer to the user space part of Orangefs as "userspace"
235fcac9d57SMike Marshallfrom here on out. Orangefs descends from PVFS, and userspace code
236fcac9d57SMike Marshallstill uses PVFS for function and variable names. Userspace typedefs
237fcac9d57SMike Marshallmany of the important structures. Function and variable names in
238fcac9d57SMike Marshallthe kernel module have been transitioned to "orangefs", and The Linux
239fcac9d57SMike MarshallCoding Style avoids typedefs, so kernel module structures that
240fcac9d57SMike Marshallcorrespond to userspace structures are not typedefed.
241fcac9d57SMike Marshall
242fcac9d57SMike MarshallThe kernel module implements a pseudo device that userspace
243fcac9d57SMike Marshallcan read from and write to. Userspace can also manipulate the
244fcac9d57SMike Marshallkernel module through the pseudo device with ioctl.
245fcac9d57SMike Marshall
24618ccb223SMauro Carvalho ChehabThe Bufmap
24718ccb223SMauro Carvalho Chehab----------
248fcac9d57SMike Marshall
249fcac9d57SMike MarshallAt startup userspace allocates two page-size-aligned (posix_memalign)
250fcac9d57SMike Marshallmlocked memory buffers, one is used for IO and one is used for readdir
251fcac9d57SMike Marshalloperations. The IO buffer is 41943040 bytes and the readdir buffer is
252fcac9d57SMike Marshall4194304 bytes. Each buffer contains logical chunks, or partitions, and
253fcac9d57SMike Marshalla pointer to each buffer is added to its own PVFS_dev_map_desc structure
254fcac9d57SMike Marshallwhich also describes its total size, as well as the size and number of
255fcac9d57SMike Marshallthe partitions.
256fcac9d57SMike Marshall
257fcac9d57SMike MarshallA pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a
258fcac9d57SMike Marshallmapping routine in the kernel module with an ioctl. The structure is
259fcac9d57SMike Marshallcopied from user space to kernel space with copy_from_user and is used
260fcac9d57SMike Marshallto initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
261fcac9d57SMike Marshallthen contains:
262fcac9d57SMike Marshall
26318ccb223SMauro Carvalho Chehab  * refcnt
26418ccb223SMauro Carvalho Chehab    - a reference counter
265fcac9d57SMike Marshall  * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
266fcac9d57SMike Marshall    partition size, which represents the filesystem's block size and
267fcac9d57SMike Marshall    is used for s_blocksize in super blocks.
268fcac9d57SMike Marshall  * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of
269fcac9d57SMike Marshall    partitions in the IO buffer.
270fcac9d57SMike Marshall  * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
271fcac9d57SMike Marshall  * total_size - the total size of the IO buffer.
272fcac9d57SMike Marshall  * page_count - the number of 4096 byte pages in the IO buffer.
27318ccb223SMauro Carvalho Chehab  * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes
274fcac9d57SMike Marshall    of kcalloced memory. This memory is used as an array of pointers
275fcac9d57SMike Marshall    to each of the pages in the IO buffer through a call to get_user_pages.
27618ccb223SMauro Carvalho Chehab  * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
277*d56b699dSBjorn Helgaas    bytes of kcalloced memory. This memory is further initialized:
278fcac9d57SMike Marshall
279fcac9d57SMike Marshall      user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
280fcac9d57SMike Marshall      structure. user_desc->ptr points to the IO buffer.
281fcac9d57SMike Marshall
28218ccb223SMauro Carvalho Chehab      ::
28318ccb223SMauro Carvalho Chehab
284fcac9d57SMike Marshall	pages_per_desc = bufmap->desc_size / PAGE_SIZE
285fcac9d57SMike Marshall	offset = 0
286fcac9d57SMike Marshall
287fcac9d57SMike Marshall        bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
288fcac9d57SMike Marshall        bufmap->desc_array[0].array_count = pages_per_desc = 1024
289fcac9d57SMike Marshall        bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096)
290fcac9d57SMike Marshall        offset += 1024
291fcac9d57SMike Marshall                           .
292fcac9d57SMike Marshall                           .
293fcac9d57SMike Marshall                           .
294fcac9d57SMike Marshall        bufmap->desc_array[9].page_array = &bufmap->page_array[offset]
295fcac9d57SMike Marshall        bufmap->desc_array[9].array_count = pages_per_desc = 1024
296fcac9d57SMike Marshall        bufmap->desc_array[9].uaddr = (user_desc->ptr) +
297fcac9d57SMike Marshall                                               (9 * 1024 * 4096)
298fcac9d57SMike Marshall        offset += 1024
299fcac9d57SMike Marshall
300fcac9d57SMike Marshall  * buffer_index_array - a desc_count sized array of ints, used to
301fcac9d57SMike Marshall    indicate which of the IO buffer's partitions are available to use.
302fcac9d57SMike Marshall  * buffer_index_lock - a spinlock to protect buffer_index_array during update.
303fcac9d57SMike Marshall  * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element
304fcac9d57SMike Marshall    int array used to indicate which of the readdir buffer's partitions are
305fcac9d57SMike Marshall    available to use.
306fcac9d57SMike Marshall  * readdir_index_lock - a spinlock to protect readdir_index_array during
307fcac9d57SMike Marshall    update.
308fcac9d57SMike Marshall
30918ccb223SMauro Carvalho ChehabOperations
31018ccb223SMauro Carvalho Chehab----------
311fcac9d57SMike Marshall
312fcac9d57SMike MarshallThe kernel module builds an "op" (struct orangefs_kernel_op_s) when it
313fcac9d57SMike Marshallneeds to communicate with userspace. Part of the op contains the "upcall"
314fcac9d57SMike Marshallwhich expresses the request to userspace. Part of the op eventually
315fcac9d57SMike Marshallcontains the "downcall" which expresses the results of the request.
316fcac9d57SMike Marshall
317fcac9d57SMike MarshallThe slab allocator is used to keep a cache of op structures handy.
318fcac9d57SMike Marshall
3199f08cfe9SMike MarshallAt init time the kernel module defines and initializes a request list
3209f08cfe9SMike Marshalland an in_progress hash table to keep track of all the ops that are
3219f08cfe9SMike Marshallin flight at any given time.
322fcac9d57SMike Marshall
3239f08cfe9SMike MarshallOps are stateful:
324fcac9d57SMike Marshall
32518ccb223SMauro Carvalho Chehab * unknown
32618ccb223SMauro Carvalho Chehab	    - op was just initialized
32718ccb223SMauro Carvalho Chehab * waiting
32818ccb223SMauro Carvalho Chehab	    - op is on request_list (upward bound)
32918ccb223SMauro Carvalho Chehab * inprogr
33018ccb223SMauro Carvalho Chehab	    - op is in progress (waiting for downcall)
33118ccb223SMauro Carvalho Chehab * serviced
33218ccb223SMauro Carvalho Chehab	    - op has matching downcall; ok
33318ccb223SMauro Carvalho Chehab * purged
33418ccb223SMauro Carvalho Chehab	    - op has to start a timer since client-core
3359f08cfe9SMike Marshall              exited uncleanly before servicing op
33618ccb223SMauro Carvalho Chehab * given up
33718ccb223SMauro Carvalho Chehab	    - submitter has given up waiting for it
338fcac9d57SMike Marshall
3399f08cfe9SMike MarshallWhen some arbitrary userspace program needs to perform a
3409f08cfe9SMike Marshallfilesystem operation on Orangefs (readdir, I/O, create, whatever)
3419f08cfe9SMike Marshallan op structure is initialized and tagged with a distinguishing ID
3429f08cfe9SMike Marshallnumber. The upcall part of the op is filled out, and the op is
3439f08cfe9SMike Marshallpassed to the "service_operation" function.
344fcac9d57SMike Marshall
3459f08cfe9SMike MarshallService_operation changes the op's state to "waiting", puts
3469f08cfe9SMike Marshallit on the request list, and signals the Orangefs file_operations.poll
3479f08cfe9SMike Marshallfunction through a wait queue. Userspace is polling the pseudo-device
3489f08cfe9SMike Marshalland thus becomes aware of the upcall request that needs to be read.
349fcac9d57SMike Marshall
3509f08cfe9SMike MarshallWhen the Orangefs file_operations.read function is triggered, the
3519f08cfe9SMike Marshallrequest list is searched for an op that seems ready-to-process.
3529f08cfe9SMike MarshallThe op is removed from the request list. The tag from the op and
3539f08cfe9SMike Marshallthe filled-out upcall struct are copy_to_user'ed back to userspace.
3549f08cfe9SMike Marshall
3559f08cfe9SMike MarshallIf any of these (and some additional protocol) copy_to_users fail,
3569f08cfe9SMike Marshallthe op's state is set to "waiting" and the op is added back to
3579f08cfe9SMike Marshallthe request list. Otherwise, the op's state is changed to "in progress",
3589f08cfe9SMike Marshalland the op is hashed on its tag and put onto the end of a list in the
3599f08cfe9SMike Marshallin_progress hash table at the index the tag hashed to.
3609f08cfe9SMike Marshall
3619f08cfe9SMike MarshallWhen userspace has assembled the response to the upcall, it
3629f08cfe9SMike Marshallwrites the response, which includes the distinguishing tag, back to
3639f08cfe9SMike Marshallthe pseudo device in a series of io_vecs. This triggers the Orangefs
3649f08cfe9SMike Marshallfile_operations.write_iter function to find the op with the associated
3659f08cfe9SMike Marshalltag and remove it from the in_progress hash table. As long as the op's
3669f08cfe9SMike Marshallstate is not "canceled" or "given up", its state is set to "serviced".
3679f08cfe9SMike MarshallThe file_operations.write_iter function returns to the waiting vfs,
3689f08cfe9SMike Marshalland back to service_operation through wait_for_matching_downcall.
3699f08cfe9SMike Marshall
3709f08cfe9SMike MarshallService operation returns to its caller with the op's downcall
3719f08cfe9SMike Marshallpart (the response to the upcall) filled out.
3729f08cfe9SMike Marshall
3739f08cfe9SMike MarshallThe "client-core" is the bridge between the kernel module and
3749f08cfe9SMike Marshalluserspace. The client-core is a daemon. The client-core has an
3759f08cfe9SMike Marshallassociated watchdog daemon. If the client-core is ever signaled
3769f08cfe9SMike Marshallto die, the watchdog daemon restarts the client-core. Even though
3779f08cfe9SMike Marshallthe client-core is restarted "right away", there is a period of
3789f08cfe9SMike Marshalltime during such an event that the client-core is dead. A dead client-core
3799f08cfe9SMike Marshallcan't be triggered by the Orangefs file_operations.poll function.
3809f08cfe9SMike MarshallOps that pass through service_operation during a "dead spell" can timeout
3819f08cfe9SMike Marshallon the wait queue and one attempt is made to recycle them. Obviously,
3829f08cfe9SMike Marshallif the client-core stays dead too long, the arbitrary userspace processes
3839f08cfe9SMike Marshalltrying to use Orangefs will be negatively affected. Waiting ops
3849f08cfe9SMike Marshallthat can't be serviced will be removed from the request list and
3859f08cfe9SMike Marshallhave their states set to "given up". In-progress ops that can't
3869f08cfe9SMike Marshallbe serviced will be removed from the in_progress hash table and
3879f08cfe9SMike Marshallhave their states set to "given up".
3889f08cfe9SMike Marshall
3899f08cfe9SMike MarshallReaddir and I/O ops are atypical with respect to their payloads.
390fcac9d57SMike Marshall
391fcac9d57SMike Marshall  - readdir ops use the smaller of the two pre-allocated pre-partitioned
392fcac9d57SMike Marshall    memory buffers. The readdir buffer is only available to userspace.
393fcac9d57SMike Marshall    The kernel module obtains an index to a free partition before launching
394fcac9d57SMike Marshall    a readdir op. Userspace deposits the results into the indexed partition
395fcac9d57SMike Marshall    and then writes them to back to the pvfs device.
396fcac9d57SMike Marshall
397fcac9d57SMike Marshall  - io (read and write) ops use the larger of the two pre-allocated
398fcac9d57SMike Marshall    pre-partitioned memory buffers. The IO buffer is accessible from
399fcac9d57SMike Marshall    both userspace and the kernel module. The kernel module obtains an
400fcac9d57SMike Marshall    index to a free partition before launching an io op. The kernel module
401fcac9d57SMike Marshall    deposits write data into the indexed partition, to be consumed
402fcac9d57SMike Marshall    directly by userspace. Userspace deposits the results of read
403fcac9d57SMike Marshall    requests into the indexed partition, to be consumed directly
404fcac9d57SMike Marshall    by the kernel module.
405fcac9d57SMike Marshall
406fcac9d57SMike MarshallResponses to kernel requests are all packaged in pvfs2_downcall_t
407fcac9d57SMike Marshallstructs. Besides a few other members, pvfs2_downcall_t contains a
408fcac9d57SMike Marshallunion of structs, each of which is associated with a particular
409fcac9d57SMike Marshallresponse type.
410fcac9d57SMike Marshall
411fcac9d57SMike MarshallThe several members outside of the union are:
41218ccb223SMauro Carvalho Chehab
41318ccb223SMauro Carvalho Chehab ``int32_t type``
41418ccb223SMauro Carvalho Chehab    - type of operation.
41518ccb223SMauro Carvalho Chehab ``int32_t status``
41618ccb223SMauro Carvalho Chehab    - return code for the operation.
41718ccb223SMauro Carvalho Chehab ``int64_t trailer_size``
41818ccb223SMauro Carvalho Chehab    - 0 unless readdir operation.
41918ccb223SMauro Carvalho Chehab ``char *trailer_buf``
42018ccb223SMauro Carvalho Chehab    - initialized to NULL, used during readdir operations.
421fcac9d57SMike Marshall
422fcac9d57SMike MarshallThe appropriate member inside the union is filled out for any
423fcac9d57SMike Marshallparticular response.
424fcac9d57SMike Marshall
425fcac9d57SMike Marshall  PVFS2_VFS_OP_FILE_IO
426fcac9d57SMike Marshall    fill a pvfs2_io_response_t
427fcac9d57SMike Marshall
428fcac9d57SMike Marshall  PVFS2_VFS_OP_LOOKUP
429fcac9d57SMike Marshall    fill a PVFS_object_kref
430fcac9d57SMike Marshall
431fcac9d57SMike Marshall  PVFS2_VFS_OP_CREATE
432fcac9d57SMike Marshall    fill a PVFS_object_kref
433fcac9d57SMike Marshall
434fcac9d57SMike Marshall  PVFS2_VFS_OP_SYMLINK
435fcac9d57SMike Marshall    fill a PVFS_object_kref
436fcac9d57SMike Marshall
437fcac9d57SMike Marshall  PVFS2_VFS_OP_GETATTR
438fcac9d57SMike Marshall    fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need)
439fcac9d57SMike Marshall    fill in a string with the link target when the object is a symlink.
440fcac9d57SMike Marshall
441fcac9d57SMike Marshall  PVFS2_VFS_OP_MKDIR
442fcac9d57SMike Marshall    fill a PVFS_object_kref
443fcac9d57SMike Marshall
444fcac9d57SMike Marshall  PVFS2_VFS_OP_STATFS
445fcac9d57SMike Marshall    fill a pvfs2_statfs_response_t with useless info <g>. It is hard for
446fcac9d57SMike Marshall    us to know, in a timely fashion, these statistics about our
447fcac9d57SMike Marshall    distributed network filesystem.
448fcac9d57SMike Marshall
449fcac9d57SMike Marshall  PVFS2_VFS_OP_FS_MOUNT
450fcac9d57SMike Marshall    fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref
451fcac9d57SMike Marshall    except its members are in a different order and "__pad1" is replaced
452fcac9d57SMike Marshall    with "id".
453fcac9d57SMike Marshall
454fcac9d57SMike Marshall  PVFS2_VFS_OP_GETXATTR
455fcac9d57SMike Marshall    fill a pvfs2_getxattr_response_t
456fcac9d57SMike Marshall
457fcac9d57SMike Marshall  PVFS2_VFS_OP_LISTXATTR
458fcac9d57SMike Marshall    fill a pvfs2_listxattr_response_t
459fcac9d57SMike Marshall
460fcac9d57SMike Marshall  PVFS2_VFS_OP_PARAM
461fcac9d57SMike Marshall    fill a pvfs2_param_response_t
462fcac9d57SMike Marshall
463fcac9d57SMike Marshall  PVFS2_VFS_OP_PERF_COUNT
464fcac9d57SMike Marshall    fill a pvfs2_perf_count_response_t
465fcac9d57SMike Marshall
466fcac9d57SMike Marshall  PVFS2_VFS_OP_FSKEY
467fcac9d57SMike Marshall    file a pvfs2_fs_key_response_t
468fcac9d57SMike Marshall
469fcac9d57SMike Marshall  PVFS2_VFS_OP_READDIR
470fcac9d57SMike Marshall    jamb everything needed to represent a pvfs2_readdir_response_t into
471fcac9d57SMike Marshall    the readdir buffer descriptor specified in the upcall.
472fcac9d57SMike Marshall
4739f08cfe9SMike MarshallUserspace uses writev() on /dev/pvfs2-req to pass responses to the requests
474fcac9d57SMike Marshallmade by the kernel side.
475fcac9d57SMike Marshall
476fcac9d57SMike MarshallA buffer_list containing:
47718ccb223SMauro Carvalho Chehab
478fcac9d57SMike Marshall  - a pointer to the prepared response to the request from the
479fcac9d57SMike Marshall    kernel (struct pvfs2_downcall_t).
480fcac9d57SMike Marshall  - and also, in the case of a readdir request, a pointer to a
481fcac9d57SMike Marshall    buffer containing descriptors for the objects in the target
482fcac9d57SMike Marshall    directory.
48318ccb223SMauro Carvalho Chehab
484fcac9d57SMike Marshall... is sent to the function (PINT_dev_write_list) which performs
485fcac9d57SMike Marshallthe writev.
486fcac9d57SMike Marshall
487fcac9d57SMike MarshallPINT_dev_write_list has a local iovec array: struct iovec io_array[10];
488fcac9d57SMike Marshall
489fcac9d57SMike MarshallThe first four elements of io_array are initialized like this for all
49018ccb223SMauro Carvalho Chehabresponses::
491fcac9d57SMike Marshall
492fcac9d57SMike Marshall  io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
493fcac9d57SMike Marshall  io_array[0].iov_len = sizeof(int32_t)
494fcac9d57SMike Marshall
495fcac9d57SMike Marshall  io_array[1].iov_base = address of global variable "pdev_magic" (int32_t)
496fcac9d57SMike Marshall  io_array[1].iov_len = sizeof(int32_t)
497fcac9d57SMike Marshall
498fcac9d57SMike Marshall  io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t)
499fcac9d57SMike Marshall  io_array[2].iov_len = sizeof(int64_t)
500fcac9d57SMike Marshall
501fcac9d57SMike Marshall  io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t)
502fcac9d57SMike Marshall                         of global variable vfs_request (vfs_request_t)
503fcac9d57SMike Marshall  io_array[3].iov_len = sizeof(pvfs2_downcall_t)
504fcac9d57SMike Marshall
50518ccb223SMauro Carvalho ChehabReaddir responses initialize the fifth element io_array like this::
506fcac9d57SMike Marshall
507fcac9d57SMike Marshall  io_array[4].iov_base = contents of member trailer_buf (char *)
508fcac9d57SMike Marshall                         from out_downcall member of global variable
509fcac9d57SMike Marshall                         vfs_request
510fcac9d57SMike Marshall  io_array[4].iov_len = contents of member trailer_size (PVFS_size)
511fcac9d57SMike Marshall                        from out_downcall member of global variable
512fcac9d57SMike Marshall                        vfs_request
513fcac9d57SMike Marshall
514302f0493SMike MarshallOrangefs exploits the dcache in order to avoid sending redundant
515302f0493SMike Marshallrequests to userspace. We keep object inode attributes up-to-date with
516302f0493SMike Marshallorangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to
517302f0493SMike Marshallhelp it decide whether or not to update an inode: "new" and "bypass".
518302f0493SMike MarshallOrangefs keeps private data in an object's inode that includes a short
519302f0493SMike Marshalltimeout value, getattr_time, which allows any iteration of
520302f0493SMike Marshallorangefs_inode_getattr to know how long it has been since the inode was
521302f0493SMike Marshallupdated. When the object is not new (new == 0) and the bypass flag is not
522302f0493SMike Marshallset (bypass == 0) orangefs_inode_getattr returns without updating the inode
523302f0493SMike Marshallif getattr_time has not timed out. Getattr_time is updated each time the
524302f0493SMike Marshallinode is updated.
525302f0493SMike Marshall
526302f0493SMike MarshallCreation of a new object (file, dir, sym-link) includes the evaluation of
527302f0493SMike Marshallits pathname, resulting in a negative directory entry for the object.
528302f0493SMike MarshallA new inode is allocated and associated with the dentry, turning it from
529302f0493SMike Marshalla negative dentry into a "productive full member of society". Orangefs
530302f0493SMike Marshallobtains the new inode from Linux with new_inode() and associates
531302f0493SMike Marshallthe inode with the dentry by sending the pair back to Linux with
532302f0493SMike Marshalld_instantiate().
533302f0493SMike Marshall
534302f0493SMike MarshallThe evaluation of a pathname for an object resolves to its corresponding
535302f0493SMike Marshalldentry. If there is no corresponding dentry, one is created for it in
536302f0493SMike Marshallthe dcache. Whenever a dentry is modified or verified Orangefs stores a
537302f0493SMike Marshallshort timeout value in the dentry's d_time, and the dentry will be trusted
538302f0493SMike Marshallfor that amount of time. Orangefs is a network filesystem, and objects
539302f0493SMike Marshallcan potentially change out-of-band with any particular Orangefs kernel module
540302f0493SMike Marshallinstance, so trusting a dentry is risky. The alternative to trusting
541302f0493SMike Marshalldentries is to always obtain the needed information from userspace - at
542302f0493SMike Marshallleast a trip to the client-core, maybe to the servers. Obtaining information
543302f0493SMike Marshallfrom a dentry is cheap, obtaining it from userspace is relatively expensive,
544302f0493SMike Marshallhence the motivation to use the dentry when possible.
545302f0493SMike Marshall
546302f0493SMike MarshallThe timeout values d_time and getattr_time are jiffy based, and the
54718ccb223SMauro Carvalho Chehabcode is designed to avoid the jiffy-wrap problem::
548302f0493SMike Marshall
549302f0493SMike Marshall    "In general, if the clock may have wrapped around more than once, there
550302f0493SMike Marshall    is no way to tell how much time has elapsed. However, if the times t1
551302f0493SMike Marshall    and t2 are known to be fairly close, we can reliably compute the
552302f0493SMike Marshall    difference in a way that takes into account the possibility that the
553302f0493SMike Marshall    clock may have wrapped between times."
554302f0493SMike Marshall
555302f0493SMike Marshallfrom course notes by instructor Andy Wang
556fcac9d57SMike Marshall
557