1471379a1SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 2471379a1SMauro Carvalho Chehab 3471379a1SMauro Carvalho Chehab============================ 47ad920b5SSage WeilCeph Distributed File System 57ad920b5SSage Weil============================ 67ad920b5SSage Weil 77ad920b5SSage WeilCeph is a distributed network file system designed to provide good 87ad920b5SSage Weilperformance, reliability, and scalability. 97ad920b5SSage Weil 107ad920b5SSage WeilBasic features include: 117ad920b5SSage Weil 127ad920b5SSage Weil * POSIX semantics 137ad920b5SSage Weil * Seamless scaling from 1 to many thousands of nodes 148136b58dSCheng Renquan * High availability and reliability. No single point of failure. 157ad920b5SSage Weil * N-way replication of data across storage nodes 167ad920b5SSage Weil * Fast recovery from node failures 177ad920b5SSage Weil * Automatic rebalancing of data on node addition/removal 187ad920b5SSage Weil * Easy deployment: most FS components are userspace daemons 197ad920b5SSage Weil 207ad920b5SSage WeilAlso, 21471379a1SMauro Carvalho Chehab 227ad920b5SSage Weil * Flexible snapshots (on any directory) 237ad920b5SSage Weil * Recursive accounting (nested files, directories, bytes) 247ad920b5SSage Weil 257ad920b5SSage WeilIn contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely 267ad920b5SSage Weilon symmetric access by all clients to shared block devices, Ceph 277ad920b5SSage Weilseparates data and metadata management into independent server 287ad920b5SSage Weilclusters, similar to Lustre. Unlike Lustre, however, metadata and 29d11ae8e0SJeff Laytonstorage nodes run entirely as user space daemons. File data is striped 307ad920b5SSage Weilacross storage nodes in large chunks to distribute workload and 317ad920b5SSage Weilfacilitate high throughputs. When storage nodes fail, data is 327ad920b5SSage Weilre-replicated in a distributed fashion by the storage nodes themselves 337ad920b5SSage Weil(with some minimal coordination from a cluster monitor), making the 347ad920b5SSage Weilsystem extremely efficient and scalable. 357ad920b5SSage Weil 367ad920b5SSage WeilMetadata servers effectively form a large, consistent, distributed 377ad920b5SSage Weilin-memory cache above the file namespace that is extremely scalable, 387ad920b5SSage Weildynamically redistributes metadata in response to workload changes, 397ad920b5SSage Weiland can tolerate arbitrary (well, non-Byzantine) node failures. The 407ad920b5SSage Weilmetadata server takes a somewhat unconventional approach to metadata 417ad920b5SSage Weilstorage to significantly improve performance for common workloads. In 427ad920b5SSage Weilparticular, inodes with only a single link are embedded in 437ad920b5SSage Weildirectories, allowing entire directories of dentries and inodes to be 447ad920b5SSage Weilloaded into its cache with a single I/O operation. The contents of 457ad920b5SSage Weilextremely large directories can be fragmented and managed by 467ad920b5SSage Weilindependent metadata servers, allowing scalable concurrent access. 477ad920b5SSage Weil 487ad920b5SSage WeilThe system offers automatic data rebalancing/migration when scaling 497ad920b5SSage Weilfrom a small cluster of just a few nodes to many hundreds, without 507ad920b5SSage Weilrequiring an administrator carve the data set into static volumes or 517ad920b5SSage Weilgo through the tedious process of migrating data between servers. 527ad920b5SSage WeilWhen the file system approaches full, new nodes can be easily added 537ad920b5SSage Weiland things will "just work." 547ad920b5SSage Weil 557ad920b5SSage WeilCeph includes flexible snapshot mechanism that allows a user to create 567ad920b5SSage Weila snapshot on any subdirectory (and its nested contents) in the 577ad920b5SSage Weilsystem. Snapshot creation and deletion are as simple as 'mkdir 587ad920b5SSage Weil.snap/foo' and 'rmdir .snap/foo'. 597ad920b5SSage Weil 60230bd8b9SLuís HenriquesSnapshot names have two limitations: 61230bd8b9SLuís Henriques 62230bd8b9SLuís Henriques* They can not start with an underscore ('_'), as these names are reserved 63230bd8b9SLuís Henriques for internal usage by the MDS. 64230bd8b9SLuís Henriques* They can not exceed 240 characters in size. This is because the MDS makes 65230bd8b9SLuís Henriques use of long snapshot names internally, which follow the format: 66230bd8b9SLuís Henriques `_<SNAPSHOT-NAME>_<INODE-NUMBER>`. Since filenames in general can't have 67230bd8b9SLuís Henriques more than 255 characters, and `<node-id>` takes 13 characters, the long 68230bd8b9SLuís Henriques snapshot names can take as much as 255 - 1 - 1 - 13 = 240. 69230bd8b9SLuís Henriques 7093a2221cSArtem IkonnikovCeph also provides some recursive accounting on directories for nested files 7193a2221cSArtem Ikonnikovand bytes. You can run the commands:: 7293a2221cSArtem Ikonnikov 7393a2221cSArtem Ikonnikov getfattr -n ceph.dir.rfiles /some/dir 7493a2221cSArtem Ikonnikov getfattr -n ceph.dir.rbytes /some/dir 7593a2221cSArtem Ikonnikov 7693a2221cSArtem Ikonnikovto get the total number of nested files and their combined size, respectively. 7793a2221cSArtem IkonnikovThis makes the identification of large disk space consumers relatively quick, 7893a2221cSArtem Ikonnikovas no 'du' or similar recursive scan of the file system is required. 797ad920b5SSage Weil 80fb18a575SLuis HenriquesFinally, Ceph also allows quotas to be set on any directory in the system. 81fb18a575SLuis HenriquesThe quota can restrict the number of bytes or the number of files stored 82fb18a575SLuis Henriquesbeneath that point in the directory hierarchy. Quotas can be set using 83471379a1SMauro Carvalho Chehabextended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg:: 84fb18a575SLuis Henriques 85fb18a575SLuis Henriques setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir 86fb18a575SLuis Henriques getfattr -n ceph.quota.max_bytes /some/dir 87fb18a575SLuis Henriques 88fb18a575SLuis HenriquesA limitation of the current quotas implementation is that it relies on the 89fb18a575SLuis Henriquescooperation of the client mounting the file system to stop writers when a 90fb18a575SLuis Henriqueslimit is reached. A modified or adversarial client cannot be prevented 91fb18a575SLuis Henriquesfrom writing as much data as it needs. 927ad920b5SSage Weil 937ad920b5SSage WeilMount Syntax 947ad920b5SSage Weil============ 957ad920b5SSage Weil 96471379a1SMauro Carvalho ChehabThe basic mount syntax is:: 977ad920b5SSage Weil 98e1b9eb50SVenky Shankar # mount -t ceph user@fsid.fs_name=/[subdir] mnt -o mon_addr=monip1[:port][/monip2[:port]] 997ad920b5SSage Weil 1007ad920b5SSage WeilYou only need to specify a single monitor, as the client will get the 1017ad920b5SSage Weilfull list when it connects. (However, if the monitor you specify 1027ad920b5SSage Weilhappens to be down, the mount won't succeed.) The port can be left 1037ad920b5SSage Weiloff if the monitor is using the default. So if the monitor is at 104471379a1SMauro Carvalho Chehab1.2.3.4:: 1057ad920b5SSage Weil 106e1b9eb50SVenky Shankar # mount -t ceph cephuser@07fe3187-00d9-42a3-814b-72a4d5e7d5be.cephfs=/ /mnt/ceph -o mon_addr=1.2.3.4 1077ad920b5SSage Weil 1087ad920b5SSage Weilis sufficient. If /sbin/mount.ceph is installed, a hostname can be 109e1b9eb50SVenky Shankarused instead of an IP address and the cluster FSID can be left out 110e1b9eb50SVenky Shankar(as the mount helper will fill it in by reading the ceph configuration 111e1b9eb50SVenky Shankarfile):: 1127ad920b5SSage Weil 113e1b9eb50SVenky Shankar # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=mon-addr 1147ad920b5SSage Weil 115e1b9eb50SVenky ShankarMultiple monitor addresses can be passed by separating each address with a slash (`/`):: 116e1b9eb50SVenky Shankar 117e1b9eb50SVenky Shankar # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=192.168.1.100/192.168.1.101 118e1b9eb50SVenky Shankar 119e1b9eb50SVenky ShankarWhen using the mount helper, monitor address can be read from ceph 120e1b9eb50SVenky Shankarconfiguration file if available. Note that, the cluster FSID (passed as part 121e1b9eb50SVenky Shankarof the device string) is validated by checking it with the FSID reported by 122e1b9eb50SVenky Shankarthe monitor. 1237ad920b5SSage Weil 1247ad920b5SSage WeilMount Options 1257ad920b5SSage Weil============= 1267ad920b5SSage Weil 127e1b9eb50SVenky Shankar mon_addr=ip_address[:port][/ip_address[:port]] 128e1b9eb50SVenky Shankar Monitor address to the cluster. This is used to bootstrap the 129e1b9eb50SVenky Shankar connection to the cluster. Once connection is established, the 130e1b9eb50SVenky Shankar monitor addresses in the monitor map are followed. 131e1b9eb50SVenky Shankar 132e1b9eb50SVenky Shankar fsid=cluster-id 133e1b9eb50SVenky Shankar FSID of the cluster (from `ceph fsid` command). 134e1b9eb50SVenky Shankar 1357ad920b5SSage Weil ip=A.B.C.D[:N] 1367ad920b5SSage Weil Specify the IP and/or port the client should bind to locally. 1377ad920b5SSage Weil There is normally not much reason to do this. If the IP is not 1387ad920b5SSage Weil specified, the client's IP address is determined by looking at the 139a33f3224SFrancis Galiegue address its connection to the monitor originates from. 1407ad920b5SSage Weil 1417ad920b5SSage Weil wsize=X 142fcc95f06SLinus Torvalds Specify the maximum write size in bytes. Default: 64 MB. 1437ad920b5SSage Weil 1447ad920b5SSage Weil rsize=X 145fcc95f06SLinus Torvalds Specify the maximum read size in bytes. Default: 64 MB. 14692c1037cSAndreas Gerstmayr 14792c1037cSAndreas Gerstmayr rasize=X 148c7f04944SChengguang Xu Specify the maximum readahead size in bytes. Default: 8 MB. 1497ad920b5SSage Weil 1507ad920b5SSage Weil mount_timeout=X 1517ad920b5SSage Weil Specify the timeout value for mount (in seconds), in the case 152fcc95f06SLinus Torvalds of a non-responsive Ceph file system. The default is 60 1537ad920b5SSage Weil seconds. 1547ad920b5SSage Weil 155fe33032dSYan, Zheng caps_max=X 156fe33032dSYan, Zheng Specify the maximum number of caps to hold. Unused caps are released 157fe33032dSYan, Zheng when number of caps exceeds the limit. The default is 0 (no limit) 158fe33032dSYan, Zheng 1597ad920b5SSage Weil rbytes 1607ad920b5SSage Weil When stat() is called on a directory, set st_size to 'rbytes', 1617ad920b5SSage Weil the summation of file sizes over all files nested beneath that 1627ad920b5SSage Weil directory. This is the default. 1637ad920b5SSage Weil 1647ad920b5SSage Weil norbytes 1657ad920b5SSage Weil When stat() is called on a directory, set st_size to the 1667ad920b5SSage Weil number of entries in that directory. 1677ad920b5SSage Weil 1687ad920b5SSage Weil nocrc 16923ab15adSSage Weil Disable CRC32C calculation for data writes. If set, the storage node 1707ad920b5SSage Weil must rely on TCP's error correction to detect data corruption 1717ad920b5SSage Weil in the data payload. 1727ad920b5SSage Weil 173a40dc6ccSSage Weil dcache 174a40dc6ccSSage Weil Use the dcache contents to perform negative lookups and 175a40dc6ccSSage Weil readdir when the client has the entire directory contents in 176a40dc6ccSSage Weil its cache. (This does not change correctness; the client uses 1777ad920b5SSage Weil cached metadata only when a lease or capability ensures it is 1787ad920b5SSage Weil valid.) 1797ad920b5SSage Weil 180a40dc6ccSSage Weil nodcache 181a40dc6ccSSage Weil Do not use the dcache as above. This avoids a significant amount of 182a40dc6ccSSage Weil complex code, sacrificing performance without affecting correctness, 183a40dc6ccSSage Weil and is useful for tracking down bugs. 184a40dc6ccSSage Weil 185a40dc6ccSSage Weil noasyncreaddir 186a40dc6ccSSage Weil Do not use the dcache as above for readdir. 1877ad920b5SSage Weil 1889122eed5SLuis Henriques noquotadf 1899122eed5SLuis Henriques Report overall filesystem usage in statfs instead of using the root 1909122eed5SLuis Henriques directory quota. 1919122eed5SLuis Henriques 192ea4cdc54SLuis Henriques nocopyfrom 193ea4cdc54SLuis Henriques Don't use the RADOS 'copy-from' operation to perform remote object 194ea4cdc54SLuis Henriques copies. Currently, it's only used in copy_file_range, which will revert 195ea4cdc54SLuis Henriques to the default VFS implementation if this option is used. 196ea4cdc54SLuis Henriques 197131d7eb4SYan, Zheng recover_session=<no|clean> 1980b98acd6SIlya Dryomov Set auto reconnect mode in the case where the client is blocklisted. The 199131d7eb4SYan, Zheng available modes are "no" and "clean". The default is "no". 200131d7eb4SYan, Zheng 201131d7eb4SYan, Zheng * no: never attempt to reconnect when client detects that it has been 2020b98acd6SIlya Dryomov blocklisted. Operations will generally fail after being blocklisted. 203131d7eb4SYan, Zheng 204131d7eb4SYan, Zheng * clean: client reconnects to the ceph cluster automatically when it 2050b98acd6SIlya Dryomov detects that it has been blocklisted. During reconnect, client drops 206131d7eb4SYan, Zheng dirty data/metadata, invalidates page caches and writable file handles. 207131d7eb4SYan, Zheng After reconnect, file locks become stale because the MDS loses track 208131d7eb4SYan, Zheng of them. If an inode contains any stale file locks, read/write on the 209131d7eb4SYan, Zheng inode is not allowed until applications release all stale file locks. 210131d7eb4SYan, Zheng 2117ad920b5SSage WeilMore Information 2127ad920b5SSage Weil================ 2137ad920b5SSage Weil 2147ad920b5SSage WeilFor more information on Ceph, see the home page at 215d11ae8e0SJeff Layton https://ceph.com/ 2167ad920b5SSage Weil 2177ad920b5SSage WeilThe Linux kernel client source tree is available at 218471379a1SMauro Carvalho Chehab - https://github.com/ceph/ceph-client.git 2197ad920b5SSage Weil 2207ad920b5SSage Weiland the source for the full system is at 221d11ae8e0SJeff Layton https://github.com/ceph/ceph.git 222