118ccb223SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 218ccb223SMauro Carvalho Chehab 318ccb223SMauro Carvalho Chehab======== 474a552a1SMike MarshallORANGEFS 574a552a1SMike Marshall======== 674a552a1SMike Marshall 774a552a1SMike MarshallOrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal 874a552a1SMike Marshallfor large storage problems faced by HPC, BigData, Streaming Video, 974a552a1SMike MarshallGenomics, Bioinformatics. 1074a552a1SMike Marshall 1174a552a1SMike MarshallOrangefs, originally called PVFS, was first developed in 1993 by 1274a552a1SMike MarshallWalt Ligon and Eric Blumer as a parallel file system for Parallel 1374a552a1SMike MarshallVirtual Machine (PVM) as part of a NASA grant to study the I/O patterns 1474a552a1SMike Marshallof parallel programs. 1574a552a1SMike Marshall 1674a552a1SMike MarshallOrangefs features include: 1774a552a1SMike Marshall 1874a552a1SMike Marshall * Distributes file data among multiple file servers 1974a552a1SMike Marshall * Supports simultaneous access by multiple clients 2074a552a1SMike Marshall * Stores file data and metadata on servers using local file system 2174a552a1SMike Marshall and access methods 2274a552a1SMike Marshall * Userspace implementation is easy to install and maintain 2374a552a1SMike Marshall * Direct MPI support 2474a552a1SMike Marshall * Stateless 2574a552a1SMike Marshall 2674a552a1SMike Marshall 2718ccb223SMauro Carvalho ChehabMailing List Archives 288e9ba5c4SMike Marshall===================== 2974a552a1SMike Marshall 308e9ba5c4SMike Marshallhttp://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ 318e9ba5c4SMike Marshall 328e9ba5c4SMike Marshall 3318ccb223SMauro Carvalho ChehabMailing List Submissions 348e9ba5c4SMike Marshall======================== 358e9ba5c4SMike Marshall 368e9ba5c4SMike Marshalldevel@lists.orangefs.org 3774a552a1SMike Marshall 3874a552a1SMike Marshall 3918ccb223SMauro Carvalho ChehabDocumentation 4074a552a1SMike Marshall============= 4174a552a1SMike Marshall 4274a552a1SMike Marshallhttp://www.orangefs.org/documentation/ 4374a552a1SMike Marshall 4418ccb223SMauro Carvalho ChehabRunning ORANGEFS On a Single Server 45dd098022SMartin Brandenburg=================================== 4674a552a1SMike Marshall 47dd098022SMartin BrandenburgOrangeFS is usually run in large installations with multiple servers and 48dd098022SMartin Brandenburgclients, but a complete filesystem can be run on a single machine for 49dd098022SMartin Brandenburgdevelopment and testing. 50dd098022SMartin Brandenburg 5118ccb223SMauro Carvalho ChehabOn Fedora, install orangefs and orangefs-server:: 52dd098022SMartin Brandenburg 53dd098022SMartin Brandenburg dnf -y install orangefs orangefs-server 54dd098022SMartin Brandenburg 55dd098022SMartin BrandenburgThere is an example server configuration file in 56dd098022SMartin Brandenburg/etc/orangefs/orangefs.conf. Change localhost to your hostname if 57dd098022SMartin Brandenburgnecessary. 58dd098022SMartin Brandenburg 59dd098022SMartin BrandenburgTo generate a filesystem to run xfstests against, see below. 60dd098022SMartin Brandenburg 61dd098022SMartin BrandenburgThere is an example client configuration file in /etc/pvfs2tab. It is a 62dd098022SMartin Brandenburgsingle line. Uncomment it and change the hostname if necessary. This 63dd098022SMartin Brandenburgcontrols clients which use libpvfs2. This does not control the 64dd098022SMartin Brandenburgpvfs2-client-core. 65dd098022SMartin Brandenburg 6618ccb223SMauro Carvalho ChehabCreate the filesystem:: 67dd098022SMartin Brandenburg 68dd098022SMartin Brandenburg pvfs2-server -f /etc/orangefs/orangefs.conf 69dd098022SMartin Brandenburg 7018ccb223SMauro Carvalho ChehabStart the server:: 71dd098022SMartin Brandenburg 72dd098022SMartin Brandenburg systemctl start orangefs-server 73dd098022SMartin Brandenburg 7418ccb223SMauro Carvalho ChehabTest the server:: 75dd098022SMartin Brandenburg 76dd098022SMartin Brandenburg pvfs2-ping -m /pvfsmnt 77dd098022SMartin Brandenburg 78dd098022SMartin BrandenburgStart the client. The module must be compiled in or loaded before this 7918ccb223SMauro Carvalho Chehabpoint:: 80dd098022SMartin Brandenburg 81dd098022SMartin Brandenburg systemctl start orangefs-client 82dd098022SMartin Brandenburg 8318ccb223SMauro Carvalho ChehabMount the filesystem:: 84dd098022SMartin Brandenburg 85dd098022SMartin Brandenburg mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt 86dd098022SMartin Brandenburg 874e4bdcfaSLinus TorvaldsUserspace Filesystem Source 884e4bdcfaSLinus Torvalds=========================== 894e4bdcfaSLinus Torvalds 904e4bdcfaSLinus Torvaldshttp://www.orangefs.org/download 914e4bdcfaSLinus Torvalds 924e4bdcfaSLinus TorvaldsOrangefs versions prior to 2.9.3 would not be compatible with the 934e4bdcfaSLinus Torvaldsupstream version of the kernel client. 944e4bdcfaSLinus Torvalds 95dd098022SMartin Brandenburg 9618ccb223SMauro Carvalho ChehabBuilding ORANGEFS on a Single Server 97dd098022SMartin Brandenburg==================================== 98dd098022SMartin Brandenburg 99dd098022SMartin BrandenburgWhere OrangeFS cannot be installed from distribution packages, it may be 100dd098022SMartin Brandenburgbuilt from source. 101dd098022SMartin Brandenburg 102dd098022SMartin BrandenburgYou can omit --prefix if you don't care that things are sprinkled around 103dd098022SMartin Brandenburgin /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by 104dd098022SMartin Brandenburgdefault, we will probably be changing the default to LMDB soon. 10574a552a1SMike Marshall 10618ccb223SMauro Carvalho Chehab:: 10718ccb223SMauro Carvalho Chehab 1084e4bdcfaSLinus Torvalds ./configure --prefix=/opt/ofs --with-db-backend=lmdb --disable-usrint 10974a552a1SMike Marshall 11074a552a1SMike Marshall make 11174a552a1SMike Marshall 11274a552a1SMike Marshall make install 11374a552a1SMike Marshall 1144e4bdcfaSLinus TorvaldsCreate an orangefs config file by running pvfs2-genconfig and 1154e4bdcfaSLinus Torvaldsspecifying a target config file. Pvfs2-genconfig will prompt you 1164e4bdcfaSLinus Torvaldsthrough. Generally it works fine to take the defaults, but you 1174e4bdcfaSLinus Torvaldsshould use your server's hostname, rather than "localhost" when 1184e4bdcfaSLinus Torvaldsit comes to that question:: 119dd098022SMartin Brandenburg 12074a552a1SMike Marshall /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf 12174a552a1SMike Marshall 122920af1ceSStephen KittCreate an /etc/pvfs2tab file (localhost is fine):: 1234e4bdcfaSLinus Torvalds 124dd098022SMartin Brandenburg echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ 125dd098022SMartin Brandenburg /etc/pvfs2tab 12674a552a1SMike Marshall 12718ccb223SMauro Carvalho ChehabCreate the mount point you specified in the tab file if needed:: 12874a552a1SMike Marshall 129dd098022SMartin Brandenburg mkdir /pvfsmnt 13074a552a1SMike Marshall 13118ccb223SMauro Carvalho ChehabBootstrap the server:: 132dd098022SMartin Brandenburg 133dd098022SMartin Brandenburg /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf 134dd098022SMartin Brandenburg 13518ccb223SMauro Carvalho ChehabStart the server:: 136dd098022SMartin Brandenburg 1374e4bdcfaSLinus Torvalds /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf 13874a552a1SMike Marshall 1398e9ba5c4SMike MarshallNow the server should be running. Pvfs2-ls is a simple 14018ccb223SMauro Carvalho Chehabtest to verify that the server is running:: 14174a552a1SMike Marshall 142dd098022SMartin Brandenburg /opt/ofs/bin/pvfs2-ls /pvfsmnt 14374a552a1SMike Marshall 1448e9ba5c4SMike MarshallIf stuff seems to be working, load the kernel module and 14518ccb223SMauro Carvalho Chehabturn on the client core:: 14674a552a1SMike Marshall 1474e4bdcfaSLinus Torvalds /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbin/pvfs2-client-core 148dd098022SMartin Brandenburg 14918ccb223SMauro Carvalho ChehabMount your filesystem:: 150dd098022SMartin Brandenburg 1514e4bdcfaSLinus Torvalds mount -t pvfs2 tcp://`hostname`:3334/orangefs /pvfsmnt 152dd098022SMartin Brandenburg 153dd098022SMartin Brandenburg 15418ccb223SMauro Carvalho ChehabRunning xfstests 155dd098022SMartin Brandenburg================ 156dd098022SMartin Brandenburg 157dd098022SMartin BrandenburgIt is useful to use a scratch filesystem with xfstests. This can be 158dd098022SMartin Brandenburgdone with only one server. 159dd098022SMartin Brandenburg 160dd098022SMartin BrandenburgMake a second copy of the FileSystem section in the server configuration 161dd098022SMartin Brandenburgfile, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. 162dd098022SMartin BrandenburgChange the ID to something other than the ID of the first FileSystem 163dd098022SMartin Brandenburgsection (2 is usually a good choice). 164dd098022SMartin Brandenburg 165dd098022SMartin BrandenburgThen there are two FileSystem sections: orangefs and scratch. 166dd098022SMartin Brandenburg 167dd098022SMartin BrandenburgThis change should be made before creating the filesystem. 168dd098022SMartin Brandenburg 16918ccb223SMauro Carvalho Chehab:: 17018ccb223SMauro Carvalho Chehab 171dd098022SMartin Brandenburg pvfs2-server -f /etc/orangefs/orangefs.conf 172dd098022SMartin Brandenburg 17318ccb223SMauro Carvalho ChehabTo run xfstests, create /etc/xfsqa.config:: 174dd098022SMartin Brandenburg 175dd098022SMartin Brandenburg TEST_DIR=/orangefs 176dd098022SMartin Brandenburg TEST_DEV=tcp://localhost:3334/orangefs 177dd098022SMartin Brandenburg SCRATCH_MNT=/scratch 178dd098022SMartin Brandenburg SCRATCH_DEV=tcp://localhost:3334/scratch 179dd098022SMartin Brandenburg 18018ccb223SMauro Carvalho ChehabThen xfstests can be run:: 181dd098022SMartin Brandenburg 182dd098022SMartin Brandenburg ./check -pvfs2 18374a552a1SMike Marshall 18474a552a1SMike Marshall 18518ccb223SMauro Carvalho ChehabOptions 18674a552a1SMike Marshall======= 18774a552a1SMike Marshall 18874a552a1SMike MarshallThe following mount options are accepted: 18974a552a1SMike Marshall 19074a552a1SMike Marshall acl 19174a552a1SMike Marshall Allow the use of Access Control Lists on files and directories. 19274a552a1SMike Marshall 19374a552a1SMike Marshall intr 19474a552a1SMike Marshall Some operations between the kernel client and the user space 19574a552a1SMike Marshall filesystem can be interruptible, such as changes in debug levels 19674a552a1SMike Marshall and the setting of tunable parameters. 19774a552a1SMike Marshall 19874a552a1SMike Marshall local_lock 19974a552a1SMike Marshall Enable posix locking from the perspective of "this" kernel. The 20074a552a1SMike Marshall default file_operations lock action is to return ENOSYS. Posix 20174a552a1SMike Marshall locking kicks in if the filesystem is mounted with -o local_lock. 20274a552a1SMike Marshall Distributed locking is being worked on for the future. 20374a552a1SMike Marshall 20474a552a1SMike Marshall 20518ccb223SMauro Carvalho ChehabDebugging 20674a552a1SMike Marshall========= 20774a552a1SMike Marshall 208fcac9d57SMike MarshallIf you want the debug (GOSSIP) statements in a particular 20918ccb223SMauro Carvalho Chehabsource file (inode.c for example) go to syslog:: 21074a552a1SMike Marshall 21174a552a1SMike Marshall echo inode > /sys/kernel/debug/orangefs/kernel-debug 21274a552a1SMike Marshall 21318ccb223SMauro Carvalho ChehabNo debugging (the default):: 21474a552a1SMike Marshall 21574a552a1SMike Marshall echo none > /sys/kernel/debug/orangefs/kernel-debug 21674a552a1SMike Marshall 21718ccb223SMauro Carvalho ChehabDebugging from several source files:: 21874a552a1SMike Marshall 21974a552a1SMike Marshall echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug 22074a552a1SMike Marshall 22118ccb223SMauro Carvalho ChehabAll debugging:: 22274a552a1SMike Marshall 22374a552a1SMike Marshall echo all > /sys/kernel/debug/orangefs/kernel-debug 22474a552a1SMike Marshall 22518ccb223SMauro Carvalho ChehabGet a list of all debugging keywords:: 22674a552a1SMike Marshall 22774a552a1SMike Marshall cat /sys/kernel/debug/orangefs/debug-help 228fcac9d57SMike Marshall 229fcac9d57SMike Marshall 23018ccb223SMauro Carvalho ChehabProtocol between Kernel Module and Userspace 231fcac9d57SMike Marshall============================================ 232fcac9d57SMike Marshall 233fcac9d57SMike MarshallOrangefs is a user space filesystem and an associated kernel module. 234fcac9d57SMike MarshallWe'll just refer to the user space part of Orangefs as "userspace" 235fcac9d57SMike Marshallfrom here on out. Orangefs descends from PVFS, and userspace code 236fcac9d57SMike Marshallstill uses PVFS for function and variable names. Userspace typedefs 237fcac9d57SMike Marshallmany of the important structures. Function and variable names in 238fcac9d57SMike Marshallthe kernel module have been transitioned to "orangefs", and The Linux 239fcac9d57SMike MarshallCoding Style avoids typedefs, so kernel module structures that 240fcac9d57SMike Marshallcorrespond to userspace structures are not typedefed. 241fcac9d57SMike Marshall 242fcac9d57SMike MarshallThe kernel module implements a pseudo device that userspace 243fcac9d57SMike Marshallcan read from and write to. Userspace can also manipulate the 244fcac9d57SMike Marshallkernel module through the pseudo device with ioctl. 245fcac9d57SMike Marshall 24618ccb223SMauro Carvalho ChehabThe Bufmap 24718ccb223SMauro Carvalho Chehab---------- 248fcac9d57SMike Marshall 249fcac9d57SMike MarshallAt startup userspace allocates two page-size-aligned (posix_memalign) 250fcac9d57SMike Marshallmlocked memory buffers, one is used for IO and one is used for readdir 251fcac9d57SMike Marshalloperations. The IO buffer is 41943040 bytes and the readdir buffer is 252fcac9d57SMike Marshall4194304 bytes. Each buffer contains logical chunks, or partitions, and 253fcac9d57SMike Marshalla pointer to each buffer is added to its own PVFS_dev_map_desc structure 254fcac9d57SMike Marshallwhich also describes its total size, as well as the size and number of 255fcac9d57SMike Marshallthe partitions. 256fcac9d57SMike Marshall 257fcac9d57SMike MarshallA pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a 258fcac9d57SMike Marshallmapping routine in the kernel module with an ioctl. The structure is 259fcac9d57SMike Marshallcopied from user space to kernel space with copy_from_user and is used 260fcac9d57SMike Marshallto initialize the kernel module's "bufmap" (struct orangefs_bufmap), which 261fcac9d57SMike Marshallthen contains: 262fcac9d57SMike Marshall 26318ccb223SMauro Carvalho Chehab * refcnt 26418ccb223SMauro Carvalho Chehab - a reference counter 265fcac9d57SMike Marshall * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's 266fcac9d57SMike Marshall partition size, which represents the filesystem's block size and 267fcac9d57SMike Marshall is used for s_blocksize in super blocks. 268fcac9d57SMike Marshall * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of 269fcac9d57SMike Marshall partitions in the IO buffer. 270fcac9d57SMike Marshall * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. 271fcac9d57SMike Marshall * total_size - the total size of the IO buffer. 272fcac9d57SMike Marshall * page_count - the number of 4096 byte pages in the IO buffer. 27318ccb223SMauro Carvalho Chehab * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes 274fcac9d57SMike Marshall of kcalloced memory. This memory is used as an array of pointers 275fcac9d57SMike Marshall to each of the pages in the IO buffer through a call to get_user_pages. 27618ccb223SMauro Carvalho Chehab * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))`` 277*d56b699dSBjorn Helgaas bytes of kcalloced memory. This memory is further initialized: 278fcac9d57SMike Marshall 279fcac9d57SMike Marshall user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc 280fcac9d57SMike Marshall structure. user_desc->ptr points to the IO buffer. 281fcac9d57SMike Marshall 28218ccb223SMauro Carvalho Chehab :: 28318ccb223SMauro Carvalho Chehab 284fcac9d57SMike Marshall pages_per_desc = bufmap->desc_size / PAGE_SIZE 285fcac9d57SMike Marshall offset = 0 286fcac9d57SMike Marshall 287fcac9d57SMike Marshall bufmap->desc_array[0].page_array = &bufmap->page_array[offset] 288fcac9d57SMike Marshall bufmap->desc_array[0].array_count = pages_per_desc = 1024 289fcac9d57SMike Marshall bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) 290fcac9d57SMike Marshall offset += 1024 291fcac9d57SMike Marshall . 292fcac9d57SMike Marshall . 293fcac9d57SMike Marshall . 294fcac9d57SMike Marshall bufmap->desc_array[9].page_array = &bufmap->page_array[offset] 295fcac9d57SMike Marshall bufmap->desc_array[9].array_count = pages_per_desc = 1024 296fcac9d57SMike Marshall bufmap->desc_array[9].uaddr = (user_desc->ptr) + 297fcac9d57SMike Marshall (9 * 1024 * 4096) 298fcac9d57SMike Marshall offset += 1024 299fcac9d57SMike Marshall 300fcac9d57SMike Marshall * buffer_index_array - a desc_count sized array of ints, used to 301fcac9d57SMike Marshall indicate which of the IO buffer's partitions are available to use. 302fcac9d57SMike Marshall * buffer_index_lock - a spinlock to protect buffer_index_array during update. 303fcac9d57SMike Marshall * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element 304fcac9d57SMike Marshall int array used to indicate which of the readdir buffer's partitions are 305fcac9d57SMike Marshall available to use. 306fcac9d57SMike Marshall * readdir_index_lock - a spinlock to protect readdir_index_array during 307fcac9d57SMike Marshall update. 308fcac9d57SMike Marshall 30918ccb223SMauro Carvalho ChehabOperations 31018ccb223SMauro Carvalho Chehab---------- 311fcac9d57SMike Marshall 312fcac9d57SMike MarshallThe kernel module builds an "op" (struct orangefs_kernel_op_s) when it 313fcac9d57SMike Marshallneeds to communicate with userspace. Part of the op contains the "upcall" 314fcac9d57SMike Marshallwhich expresses the request to userspace. Part of the op eventually 315fcac9d57SMike Marshallcontains the "downcall" which expresses the results of the request. 316fcac9d57SMike Marshall 317fcac9d57SMike MarshallThe slab allocator is used to keep a cache of op structures handy. 318fcac9d57SMike Marshall 3199f08cfe9SMike MarshallAt init time the kernel module defines and initializes a request list 3209f08cfe9SMike Marshalland an in_progress hash table to keep track of all the ops that are 3219f08cfe9SMike Marshallin flight at any given time. 322fcac9d57SMike Marshall 3239f08cfe9SMike MarshallOps are stateful: 324fcac9d57SMike Marshall 32518ccb223SMauro Carvalho Chehab * unknown 32618ccb223SMauro Carvalho Chehab - op was just initialized 32718ccb223SMauro Carvalho Chehab * waiting 32818ccb223SMauro Carvalho Chehab - op is on request_list (upward bound) 32918ccb223SMauro Carvalho Chehab * inprogr 33018ccb223SMauro Carvalho Chehab - op is in progress (waiting for downcall) 33118ccb223SMauro Carvalho Chehab * serviced 33218ccb223SMauro Carvalho Chehab - op has matching downcall; ok 33318ccb223SMauro Carvalho Chehab * purged 33418ccb223SMauro Carvalho Chehab - op has to start a timer since client-core 3359f08cfe9SMike Marshall exited uncleanly before servicing op 33618ccb223SMauro Carvalho Chehab * given up 33718ccb223SMauro Carvalho Chehab - submitter has given up waiting for it 338fcac9d57SMike Marshall 3399f08cfe9SMike MarshallWhen some arbitrary userspace program needs to perform a 3409f08cfe9SMike Marshallfilesystem operation on Orangefs (readdir, I/O, create, whatever) 3419f08cfe9SMike Marshallan op structure is initialized and tagged with a distinguishing ID 3429f08cfe9SMike Marshallnumber. The upcall part of the op is filled out, and the op is 3439f08cfe9SMike Marshallpassed to the "service_operation" function. 344fcac9d57SMike Marshall 3459f08cfe9SMike MarshallService_operation changes the op's state to "waiting", puts 3469f08cfe9SMike Marshallit on the request list, and signals the Orangefs file_operations.poll 3479f08cfe9SMike Marshallfunction through a wait queue. Userspace is polling the pseudo-device 3489f08cfe9SMike Marshalland thus becomes aware of the upcall request that needs to be read. 349fcac9d57SMike Marshall 3509f08cfe9SMike MarshallWhen the Orangefs file_operations.read function is triggered, the 3519f08cfe9SMike Marshallrequest list is searched for an op that seems ready-to-process. 3529f08cfe9SMike MarshallThe op is removed from the request list. The tag from the op and 3539f08cfe9SMike Marshallthe filled-out upcall struct are copy_to_user'ed back to userspace. 3549f08cfe9SMike Marshall 3559f08cfe9SMike MarshallIf any of these (and some additional protocol) copy_to_users fail, 3569f08cfe9SMike Marshallthe op's state is set to "waiting" and the op is added back to 3579f08cfe9SMike Marshallthe request list. Otherwise, the op's state is changed to "in progress", 3589f08cfe9SMike Marshalland the op is hashed on its tag and put onto the end of a list in the 3599f08cfe9SMike Marshallin_progress hash table at the index the tag hashed to. 3609f08cfe9SMike Marshall 3619f08cfe9SMike MarshallWhen userspace has assembled the response to the upcall, it 3629f08cfe9SMike Marshallwrites the response, which includes the distinguishing tag, back to 3639f08cfe9SMike Marshallthe pseudo device in a series of io_vecs. This triggers the Orangefs 3649f08cfe9SMike Marshallfile_operations.write_iter function to find the op with the associated 3659f08cfe9SMike Marshalltag and remove it from the in_progress hash table. As long as the op's 3669f08cfe9SMike Marshallstate is not "canceled" or "given up", its state is set to "serviced". 3679f08cfe9SMike MarshallThe file_operations.write_iter function returns to the waiting vfs, 3689f08cfe9SMike Marshalland back to service_operation through wait_for_matching_downcall. 3699f08cfe9SMike Marshall 3709f08cfe9SMike MarshallService operation returns to its caller with the op's downcall 3719f08cfe9SMike Marshallpart (the response to the upcall) filled out. 3729f08cfe9SMike Marshall 3739f08cfe9SMike MarshallThe "client-core" is the bridge between the kernel module and 3749f08cfe9SMike Marshalluserspace. The client-core is a daemon. The client-core has an 3759f08cfe9SMike Marshallassociated watchdog daemon. If the client-core is ever signaled 3769f08cfe9SMike Marshallto die, the watchdog daemon restarts the client-core. Even though 3779f08cfe9SMike Marshallthe client-core is restarted "right away", there is a period of 3789f08cfe9SMike Marshalltime during such an event that the client-core is dead. A dead client-core 3799f08cfe9SMike Marshallcan't be triggered by the Orangefs file_operations.poll function. 3809f08cfe9SMike MarshallOps that pass through service_operation during a "dead spell" can timeout 3819f08cfe9SMike Marshallon the wait queue and one attempt is made to recycle them. Obviously, 3829f08cfe9SMike Marshallif the client-core stays dead too long, the arbitrary userspace processes 3839f08cfe9SMike Marshalltrying to use Orangefs will be negatively affected. Waiting ops 3849f08cfe9SMike Marshallthat can't be serviced will be removed from the request list and 3859f08cfe9SMike Marshallhave their states set to "given up". In-progress ops that can't 3869f08cfe9SMike Marshallbe serviced will be removed from the in_progress hash table and 3879f08cfe9SMike Marshallhave their states set to "given up". 3889f08cfe9SMike Marshall 3899f08cfe9SMike MarshallReaddir and I/O ops are atypical with respect to their payloads. 390fcac9d57SMike Marshall 391fcac9d57SMike Marshall - readdir ops use the smaller of the two pre-allocated pre-partitioned 392fcac9d57SMike Marshall memory buffers. The readdir buffer is only available to userspace. 393fcac9d57SMike Marshall The kernel module obtains an index to a free partition before launching 394fcac9d57SMike Marshall a readdir op. Userspace deposits the results into the indexed partition 395fcac9d57SMike Marshall and then writes them to back to the pvfs device. 396fcac9d57SMike Marshall 397fcac9d57SMike Marshall - io (read and write) ops use the larger of the two pre-allocated 398fcac9d57SMike Marshall pre-partitioned memory buffers. The IO buffer is accessible from 399fcac9d57SMike Marshall both userspace and the kernel module. The kernel module obtains an 400fcac9d57SMike Marshall index to a free partition before launching an io op. The kernel module 401fcac9d57SMike Marshall deposits write data into the indexed partition, to be consumed 402fcac9d57SMike Marshall directly by userspace. Userspace deposits the results of read 403fcac9d57SMike Marshall requests into the indexed partition, to be consumed directly 404fcac9d57SMike Marshall by the kernel module. 405fcac9d57SMike Marshall 406fcac9d57SMike MarshallResponses to kernel requests are all packaged in pvfs2_downcall_t 407fcac9d57SMike Marshallstructs. Besides a few other members, pvfs2_downcall_t contains a 408fcac9d57SMike Marshallunion of structs, each of which is associated with a particular 409fcac9d57SMike Marshallresponse type. 410fcac9d57SMike Marshall 411fcac9d57SMike MarshallThe several members outside of the union are: 41218ccb223SMauro Carvalho Chehab 41318ccb223SMauro Carvalho Chehab ``int32_t type`` 41418ccb223SMauro Carvalho Chehab - type of operation. 41518ccb223SMauro Carvalho Chehab ``int32_t status`` 41618ccb223SMauro Carvalho Chehab - return code for the operation. 41718ccb223SMauro Carvalho Chehab ``int64_t trailer_size`` 41818ccb223SMauro Carvalho Chehab - 0 unless readdir operation. 41918ccb223SMauro Carvalho Chehab ``char *trailer_buf`` 42018ccb223SMauro Carvalho Chehab - initialized to NULL, used during readdir operations. 421fcac9d57SMike Marshall 422fcac9d57SMike MarshallThe appropriate member inside the union is filled out for any 423fcac9d57SMike Marshallparticular response. 424fcac9d57SMike Marshall 425fcac9d57SMike Marshall PVFS2_VFS_OP_FILE_IO 426fcac9d57SMike Marshall fill a pvfs2_io_response_t 427fcac9d57SMike Marshall 428fcac9d57SMike Marshall PVFS2_VFS_OP_LOOKUP 429fcac9d57SMike Marshall fill a PVFS_object_kref 430fcac9d57SMike Marshall 431fcac9d57SMike Marshall PVFS2_VFS_OP_CREATE 432fcac9d57SMike Marshall fill a PVFS_object_kref 433fcac9d57SMike Marshall 434fcac9d57SMike Marshall PVFS2_VFS_OP_SYMLINK 435fcac9d57SMike Marshall fill a PVFS_object_kref 436fcac9d57SMike Marshall 437fcac9d57SMike Marshall PVFS2_VFS_OP_GETATTR 438fcac9d57SMike Marshall fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) 439fcac9d57SMike Marshall fill in a string with the link target when the object is a symlink. 440fcac9d57SMike Marshall 441fcac9d57SMike Marshall PVFS2_VFS_OP_MKDIR 442fcac9d57SMike Marshall fill a PVFS_object_kref 443fcac9d57SMike Marshall 444fcac9d57SMike Marshall PVFS2_VFS_OP_STATFS 445fcac9d57SMike Marshall fill a pvfs2_statfs_response_t with useless info <g>. It is hard for 446fcac9d57SMike Marshall us to know, in a timely fashion, these statistics about our 447fcac9d57SMike Marshall distributed network filesystem. 448fcac9d57SMike Marshall 449fcac9d57SMike Marshall PVFS2_VFS_OP_FS_MOUNT 450fcac9d57SMike Marshall fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref 451fcac9d57SMike Marshall except its members are in a different order and "__pad1" is replaced 452fcac9d57SMike Marshall with "id". 453fcac9d57SMike Marshall 454fcac9d57SMike Marshall PVFS2_VFS_OP_GETXATTR 455fcac9d57SMike Marshall fill a pvfs2_getxattr_response_t 456fcac9d57SMike Marshall 457fcac9d57SMike Marshall PVFS2_VFS_OP_LISTXATTR 458fcac9d57SMike Marshall fill a pvfs2_listxattr_response_t 459fcac9d57SMike Marshall 460fcac9d57SMike Marshall PVFS2_VFS_OP_PARAM 461fcac9d57SMike Marshall fill a pvfs2_param_response_t 462fcac9d57SMike Marshall 463fcac9d57SMike Marshall PVFS2_VFS_OP_PERF_COUNT 464fcac9d57SMike Marshall fill a pvfs2_perf_count_response_t 465fcac9d57SMike Marshall 466fcac9d57SMike Marshall PVFS2_VFS_OP_FSKEY 467fcac9d57SMike Marshall file a pvfs2_fs_key_response_t 468fcac9d57SMike Marshall 469fcac9d57SMike Marshall PVFS2_VFS_OP_READDIR 470fcac9d57SMike Marshall jamb everything needed to represent a pvfs2_readdir_response_t into 471fcac9d57SMike Marshall the readdir buffer descriptor specified in the upcall. 472fcac9d57SMike Marshall 4739f08cfe9SMike MarshallUserspace uses writev() on /dev/pvfs2-req to pass responses to the requests 474fcac9d57SMike Marshallmade by the kernel side. 475fcac9d57SMike Marshall 476fcac9d57SMike MarshallA buffer_list containing: 47718ccb223SMauro Carvalho Chehab 478fcac9d57SMike Marshall - a pointer to the prepared response to the request from the 479fcac9d57SMike Marshall kernel (struct pvfs2_downcall_t). 480fcac9d57SMike Marshall - and also, in the case of a readdir request, a pointer to a 481fcac9d57SMike Marshall buffer containing descriptors for the objects in the target 482fcac9d57SMike Marshall directory. 48318ccb223SMauro Carvalho Chehab 484fcac9d57SMike Marshall... is sent to the function (PINT_dev_write_list) which performs 485fcac9d57SMike Marshallthe writev. 486fcac9d57SMike Marshall 487fcac9d57SMike MarshallPINT_dev_write_list has a local iovec array: struct iovec io_array[10]; 488fcac9d57SMike Marshall 489fcac9d57SMike MarshallThe first four elements of io_array are initialized like this for all 49018ccb223SMauro Carvalho Chehabresponses:: 491fcac9d57SMike Marshall 492fcac9d57SMike Marshall io_array[0].iov_base = address of local variable "proto_ver" (int32_t) 493fcac9d57SMike Marshall io_array[0].iov_len = sizeof(int32_t) 494fcac9d57SMike Marshall 495fcac9d57SMike Marshall io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) 496fcac9d57SMike Marshall io_array[1].iov_len = sizeof(int32_t) 497fcac9d57SMike Marshall 498fcac9d57SMike Marshall io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) 499fcac9d57SMike Marshall io_array[2].iov_len = sizeof(int64_t) 500fcac9d57SMike Marshall 501fcac9d57SMike Marshall io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) 502fcac9d57SMike Marshall of global variable vfs_request (vfs_request_t) 503fcac9d57SMike Marshall io_array[3].iov_len = sizeof(pvfs2_downcall_t) 504fcac9d57SMike Marshall 50518ccb223SMauro Carvalho ChehabReaddir responses initialize the fifth element io_array like this:: 506fcac9d57SMike Marshall 507fcac9d57SMike Marshall io_array[4].iov_base = contents of member trailer_buf (char *) 508fcac9d57SMike Marshall from out_downcall member of global variable 509fcac9d57SMike Marshall vfs_request 510fcac9d57SMike Marshall io_array[4].iov_len = contents of member trailer_size (PVFS_size) 511fcac9d57SMike Marshall from out_downcall member of global variable 512fcac9d57SMike Marshall vfs_request 513fcac9d57SMike Marshall 514302f0493SMike MarshallOrangefs exploits the dcache in order to avoid sending redundant 515302f0493SMike Marshallrequests to userspace. We keep object inode attributes up-to-date with 516302f0493SMike Marshallorangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to 517302f0493SMike Marshallhelp it decide whether or not to update an inode: "new" and "bypass". 518302f0493SMike MarshallOrangefs keeps private data in an object's inode that includes a short 519302f0493SMike Marshalltimeout value, getattr_time, which allows any iteration of 520302f0493SMike Marshallorangefs_inode_getattr to know how long it has been since the inode was 521302f0493SMike Marshallupdated. When the object is not new (new == 0) and the bypass flag is not 522302f0493SMike Marshallset (bypass == 0) orangefs_inode_getattr returns without updating the inode 523302f0493SMike Marshallif getattr_time has not timed out. Getattr_time is updated each time the 524302f0493SMike Marshallinode is updated. 525302f0493SMike Marshall 526302f0493SMike MarshallCreation of a new object (file, dir, sym-link) includes the evaluation of 527302f0493SMike Marshallits pathname, resulting in a negative directory entry for the object. 528302f0493SMike MarshallA new inode is allocated and associated with the dentry, turning it from 529302f0493SMike Marshalla negative dentry into a "productive full member of society". Orangefs 530302f0493SMike Marshallobtains the new inode from Linux with new_inode() and associates 531302f0493SMike Marshallthe inode with the dentry by sending the pair back to Linux with 532302f0493SMike Marshalld_instantiate(). 533302f0493SMike Marshall 534302f0493SMike MarshallThe evaluation of a pathname for an object resolves to its corresponding 535302f0493SMike Marshalldentry. If there is no corresponding dentry, one is created for it in 536302f0493SMike Marshallthe dcache. Whenever a dentry is modified or verified Orangefs stores a 537302f0493SMike Marshallshort timeout value in the dentry's d_time, and the dentry will be trusted 538302f0493SMike Marshallfor that amount of time. Orangefs is a network filesystem, and objects 539302f0493SMike Marshallcan potentially change out-of-band with any particular Orangefs kernel module 540302f0493SMike Marshallinstance, so trusting a dentry is risky. The alternative to trusting 541302f0493SMike Marshalldentries is to always obtain the needed information from userspace - at 542302f0493SMike Marshallleast a trip to the client-core, maybe to the servers. Obtaining information 543302f0493SMike Marshallfrom a dentry is cheap, obtaining it from userspace is relatively expensive, 544302f0493SMike Marshallhence the motivation to use the dentry when possible. 545302f0493SMike Marshall 546302f0493SMike MarshallThe timeout values d_time and getattr_time are jiffy based, and the 54718ccb223SMauro Carvalho Chehabcode is designed to avoid the jiffy-wrap problem:: 548302f0493SMike Marshall 549302f0493SMike Marshall "In general, if the clock may have wrapped around more than once, there 550302f0493SMike Marshall is no way to tell how much time has elapsed. However, if the times t1 551302f0493SMike Marshall and t2 are known to be fairly close, we can reliably compute the 552302f0493SMike Marshall difference in a way that takes into account the possibility that the 553302f0493SMike Marshall clock may have wrapped between times." 554302f0493SMike Marshall 555302f0493SMike Marshallfrom course notes by instructor Andy Wang 556fcac9d57SMike Marshall 557