xref: /linux/Documentation/filesystems/ceph.rst (revision c771600c6af14749609b49565ffb4cac2959710d)
1471379a1SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2471379a1SMauro Carvalho Chehab
3471379a1SMauro Carvalho Chehab============================
47ad920b5SSage WeilCeph Distributed File System
57ad920b5SSage Weil============================
67ad920b5SSage Weil
77ad920b5SSage WeilCeph is a distributed network file system designed to provide good
87ad920b5SSage Weilperformance, reliability, and scalability.
97ad920b5SSage Weil
107ad920b5SSage WeilBasic features include:
117ad920b5SSage Weil
127ad920b5SSage Weil * POSIX semantics
137ad920b5SSage Weil * Seamless scaling from 1 to many thousands of nodes
148136b58dSCheng Renquan * High availability and reliability.  No single point of failure.
157ad920b5SSage Weil * N-way replication of data across storage nodes
167ad920b5SSage Weil * Fast recovery from node failures
177ad920b5SSage Weil * Automatic rebalancing of data on node addition/removal
187ad920b5SSage Weil * Easy deployment: most FS components are userspace daemons
197ad920b5SSage Weil
207ad920b5SSage WeilAlso,
21471379a1SMauro Carvalho Chehab
227ad920b5SSage Weil * Flexible snapshots (on any directory)
237ad920b5SSage Weil * Recursive accounting (nested files, directories, bytes)
247ad920b5SSage Weil
257ad920b5SSage WeilIn contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
267ad920b5SSage Weilon symmetric access by all clients to shared block devices, Ceph
277ad920b5SSage Weilseparates data and metadata management into independent server
287ad920b5SSage Weilclusters, similar to Lustre.  Unlike Lustre, however, metadata and
29d11ae8e0SJeff Laytonstorage nodes run entirely as user space daemons.  File data is striped
307ad920b5SSage Weilacross storage nodes in large chunks to distribute workload and
317ad920b5SSage Weilfacilitate high throughputs.  When storage nodes fail, data is
327ad920b5SSage Weilre-replicated in a distributed fashion by the storage nodes themselves
337ad920b5SSage Weil(with some minimal coordination from a cluster monitor), making the
347ad920b5SSage Weilsystem extremely efficient and scalable.
357ad920b5SSage Weil
367ad920b5SSage WeilMetadata servers effectively form a large, consistent, distributed
377ad920b5SSage Weilin-memory cache above the file namespace that is extremely scalable,
387ad920b5SSage Weildynamically redistributes metadata in response to workload changes,
397ad920b5SSage Weiland can tolerate arbitrary (well, non-Byzantine) node failures.  The
407ad920b5SSage Weilmetadata server takes a somewhat unconventional approach to metadata
417ad920b5SSage Weilstorage to significantly improve performance for common workloads.  In
427ad920b5SSage Weilparticular, inodes with only a single link are embedded in
437ad920b5SSage Weildirectories, allowing entire directories of dentries and inodes to be
447ad920b5SSage Weilloaded into its cache with a single I/O operation.  The contents of
457ad920b5SSage Weilextremely large directories can be fragmented and managed by
467ad920b5SSage Weilindependent metadata servers, allowing scalable concurrent access.
477ad920b5SSage Weil
487ad920b5SSage WeilThe system offers automatic data rebalancing/migration when scaling
497ad920b5SSage Weilfrom a small cluster of just a few nodes to many hundreds, without
507ad920b5SSage Weilrequiring an administrator carve the data set into static volumes or
517ad920b5SSage Weilgo through the tedious process of migrating data between servers.
527ad920b5SSage WeilWhen the file system approaches full, new nodes can be easily added
537ad920b5SSage Weiland things will "just work."
547ad920b5SSage Weil
557ad920b5SSage WeilCeph includes flexible snapshot mechanism that allows a user to create
567ad920b5SSage Weila snapshot on any subdirectory (and its nested contents) in the
577ad920b5SSage Weilsystem.  Snapshot creation and deletion are as simple as 'mkdir
587ad920b5SSage Weil.snap/foo' and 'rmdir .snap/foo'.
597ad920b5SSage Weil
60230bd8b9SLuís HenriquesSnapshot names have two limitations:
61230bd8b9SLuís Henriques
62230bd8b9SLuís Henriques* They can not start with an underscore ('_'), as these names are reserved
63230bd8b9SLuís Henriques  for internal usage by the MDS.
64230bd8b9SLuís Henriques* They can not exceed 240 characters in size.  This is because the MDS makes
65230bd8b9SLuís Henriques  use of long snapshot names internally, which follow the format:
66230bd8b9SLuís Henriques  `_<SNAPSHOT-NAME>_<INODE-NUMBER>`.  Since filenames in general can't have
67230bd8b9SLuís Henriques  more than 255 characters, and `<node-id>` takes 13 characters, the long
68230bd8b9SLuís Henriques  snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
69230bd8b9SLuís Henriques
7093a2221cSArtem IkonnikovCeph also provides some recursive accounting on directories for nested files
7193a2221cSArtem Ikonnikovand bytes.  You can run the commands::
7293a2221cSArtem Ikonnikov
7393a2221cSArtem Ikonnikov getfattr -n ceph.dir.rfiles /some/dir
7493a2221cSArtem Ikonnikov getfattr -n ceph.dir.rbytes /some/dir
7593a2221cSArtem Ikonnikov
7693a2221cSArtem Ikonnikovto get the total number of nested files and their combined size, respectively.
7793a2221cSArtem IkonnikovThis makes the identification of large disk space consumers relatively quick,
7893a2221cSArtem Ikonnikovas no 'du' or similar recursive scan of the file system is required.
797ad920b5SSage Weil
80fb18a575SLuis HenriquesFinally, Ceph also allows quotas to be set on any directory in the system.
81fb18a575SLuis HenriquesThe quota can restrict the number of bytes or the number of files stored
82fb18a575SLuis Henriquesbeneath that point in the directory hierarchy.  Quotas can be set using
83471379a1SMauro Carvalho Chehabextended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg::
84fb18a575SLuis Henriques
85fb18a575SLuis Henriques setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
86fb18a575SLuis Henriques getfattr -n ceph.quota.max_bytes /some/dir
87fb18a575SLuis Henriques
88fb18a575SLuis HenriquesA limitation of the current quotas implementation is that it relies on the
89fb18a575SLuis Henriquescooperation of the client mounting the file system to stop writers when a
90fb18a575SLuis Henriqueslimit is reached.  A modified or adversarial client cannot be prevented
91fb18a575SLuis Henriquesfrom writing as much data as it needs.
927ad920b5SSage Weil
937ad920b5SSage WeilMount Syntax
947ad920b5SSage Weil============
957ad920b5SSage Weil
96471379a1SMauro Carvalho ChehabThe basic mount syntax is::
977ad920b5SSage Weil
98e1b9eb50SVenky Shankar # mount -t ceph user@fsid.fs_name=/[subdir] mnt -o mon_addr=monip1[:port][/monip2[:port]]
997ad920b5SSage Weil
1007ad920b5SSage WeilYou only need to specify a single monitor, as the client will get the
1017ad920b5SSage Weilfull list when it connects.  (However, if the monitor you specify
1027ad920b5SSage Weilhappens to be down, the mount won't succeed.)  The port can be left
1037ad920b5SSage Weiloff if the monitor is using the default.  So if the monitor is at
104471379a1SMauro Carvalho Chehab1.2.3.4::
1057ad920b5SSage Weil
106e1b9eb50SVenky Shankar # mount -t ceph cephuser@07fe3187-00d9-42a3-814b-72a4d5e7d5be.cephfs=/ /mnt/ceph -o mon_addr=1.2.3.4
1077ad920b5SSage Weil
1087ad920b5SSage Weilis sufficient.  If /sbin/mount.ceph is installed, a hostname can be
109e1b9eb50SVenky Shankarused instead of an IP address and the cluster FSID can be left out
110e1b9eb50SVenky Shankar(as the mount helper will fill it in by reading the ceph configuration
111e1b9eb50SVenky Shankarfile)::
1127ad920b5SSage Weil
113e1b9eb50SVenky Shankar  # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=mon-addr
1147ad920b5SSage Weil
115e1b9eb50SVenky ShankarMultiple monitor addresses can be passed by separating each address with a slash (`/`)::
116e1b9eb50SVenky Shankar
117e1b9eb50SVenky Shankar  # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=192.168.1.100/192.168.1.101
118e1b9eb50SVenky Shankar
119e1b9eb50SVenky ShankarWhen using the mount helper, monitor address can be read from ceph
120e1b9eb50SVenky Shankarconfiguration file if available. Note that, the cluster FSID (passed as part
121e1b9eb50SVenky Shankarof the device string) is validated by checking it with the FSID reported by
122e1b9eb50SVenky Shankarthe monitor.
1237ad920b5SSage Weil
1247ad920b5SSage WeilMount Options
1257ad920b5SSage Weil=============
1267ad920b5SSage Weil
127e1b9eb50SVenky Shankar  mon_addr=ip_address[:port][/ip_address[:port]]
128e1b9eb50SVenky Shankar	Monitor address to the cluster. This is used to bootstrap the
129e1b9eb50SVenky Shankar        connection to the cluster. Once connection is established, the
130e1b9eb50SVenky Shankar        monitor addresses in the monitor map are followed.
131e1b9eb50SVenky Shankar
132e1b9eb50SVenky Shankar  fsid=cluster-id
133e1b9eb50SVenky Shankar	FSID of the cluster (from `ceph fsid` command).
134e1b9eb50SVenky Shankar
1357ad920b5SSage Weil  ip=A.B.C.D[:N]
1367ad920b5SSage Weil	Specify the IP and/or port the client should bind to locally.
1377ad920b5SSage Weil	There is normally not much reason to do this.  If the IP is not
1387ad920b5SSage Weil	specified, the client's IP address is determined by looking at the
139a33f3224SFrancis Galiegue	address its connection to the monitor originates from.
1407ad920b5SSage Weil
1417ad920b5SSage Weil  wsize=X
142fcc95f06SLinus Torvalds	Specify the maximum write size in bytes.  Default: 64 MB.
1437ad920b5SSage Weil
1447ad920b5SSage Weil  rsize=X
145fcc95f06SLinus Torvalds	Specify the maximum read size in bytes.  Default: 64 MB.
14692c1037cSAndreas Gerstmayr
14792c1037cSAndreas Gerstmayr  rasize=X
148c7f04944SChengguang Xu	Specify the maximum readahead size in bytes.  Default: 8 MB.
1497ad920b5SSage Weil
1507ad920b5SSage Weil  mount_timeout=X
1517ad920b5SSage Weil	Specify the timeout value for mount (in seconds), in the case
152fcc95f06SLinus Torvalds	of a non-responsive Ceph file system.  The default is 60
1537ad920b5SSage Weil	seconds.
1547ad920b5SSage Weil
155fe33032dSYan, Zheng  caps_max=X
156fe33032dSYan, Zheng	Specify the maximum number of caps to hold. Unused caps are released
157fe33032dSYan, Zheng	when number of caps exceeds the limit. The default is 0 (no limit)
158fe33032dSYan, Zheng
1597ad920b5SSage Weil  rbytes
1607ad920b5SSage Weil	When stat() is called on a directory, set st_size to 'rbytes',
1617ad920b5SSage Weil	the summation of file sizes over all files nested beneath that
1627ad920b5SSage Weil	directory.  This is the default.
1637ad920b5SSage Weil
1647ad920b5SSage Weil  norbytes
1657ad920b5SSage Weil	When stat() is called on a directory, set st_size to the
1667ad920b5SSage Weil	number of entries in that directory.
1677ad920b5SSage Weil
1687ad920b5SSage Weil  nocrc
16923ab15adSSage Weil	Disable CRC32C calculation for data writes.  If set, the storage node
1707ad920b5SSage Weil	must rely on TCP's error correction to detect data corruption
1717ad920b5SSage Weil	in the data payload.
1727ad920b5SSage Weil
173a40dc6ccSSage Weil  dcache
174a40dc6ccSSage Weil        Use the dcache contents to perform negative lookups and
175a40dc6ccSSage Weil        readdir when the client has the entire directory contents in
176a40dc6ccSSage Weil        its cache.  (This does not change correctness; the client uses
1777ad920b5SSage Weil        cached metadata only when a lease or capability ensures it is
1787ad920b5SSage Weil        valid.)
1797ad920b5SSage Weil
180a40dc6ccSSage Weil  nodcache
181a40dc6ccSSage Weil        Do not use the dcache as above.  This avoids a significant amount of
182a40dc6ccSSage Weil        complex code, sacrificing performance without affecting correctness,
183a40dc6ccSSage Weil        and is useful for tracking down bugs.
184a40dc6ccSSage Weil
185a40dc6ccSSage Weil  noasyncreaddir
186a40dc6ccSSage Weil	Do not use the dcache as above for readdir.
1877ad920b5SSage Weil
1889122eed5SLuis Henriques  noquotadf
1899122eed5SLuis Henriques        Report overall filesystem usage in statfs instead of using the root
1909122eed5SLuis Henriques        directory quota.
1919122eed5SLuis Henriques
192ea4cdc54SLuis Henriques  nocopyfrom
193ea4cdc54SLuis Henriques        Don't use the RADOS 'copy-from' operation to perform remote object
194ea4cdc54SLuis Henriques        copies.  Currently, it's only used in copy_file_range, which will revert
195ea4cdc54SLuis Henriques        to the default VFS implementation if this option is used.
196ea4cdc54SLuis Henriques
197131d7eb4SYan, Zheng  recover_session=<no|clean>
1980b98acd6SIlya Dryomov	Set auto reconnect mode in the case where the client is blocklisted. The
199131d7eb4SYan, Zheng	available modes are "no" and "clean". The default is "no".
200131d7eb4SYan, Zheng
201131d7eb4SYan, Zheng	* no: never attempt to reconnect when client detects that it has been
2020b98acd6SIlya Dryomov	  blocklisted. Operations will generally fail after being blocklisted.
203131d7eb4SYan, Zheng
204131d7eb4SYan, Zheng	* clean: client reconnects to the ceph cluster automatically when it
2050b98acd6SIlya Dryomov	  detects that it has been blocklisted. During reconnect, client drops
206131d7eb4SYan, Zheng	  dirty data/metadata, invalidates page caches and writable file handles.
207131d7eb4SYan, Zheng	  After reconnect, file locks become stale because the MDS loses track
208131d7eb4SYan, Zheng	  of them. If an inode contains any stale file locks, read/write on the
209131d7eb4SYan, Zheng	  inode is not allowed until applications release all stale file locks.
210131d7eb4SYan, Zheng
2117ad920b5SSage WeilMore Information
2127ad920b5SSage Weil================
2137ad920b5SSage Weil
2147ad920b5SSage WeilFor more information on Ceph, see the home page at
215d11ae8e0SJeff Layton	https://ceph.com/
2167ad920b5SSage Weil
2177ad920b5SSage WeilThe Linux kernel client source tree is available at
218471379a1SMauro Carvalho Chehab	- https://github.com/ceph/ceph-client.git
2197ad920b5SSage Weil
2207ad920b5SSage Weiland the source for the full system is at
221d11ae8e0SJeff Layton	https://github.com/ceph/ceph.git
222