What is a distributed file system (DFS)?
A distributed file system (DFS) is a file system that enables clients to access file storage from multiple hosts through a computer network as if the user was accessing local storage. Files are spread across multiple storage servers and in multiple locations, which enables users to share data and storage resources. A DFS can be designed so geographically distributed users, such as remote workers and distributed teams, can access and share files remotely as if they were stored locally.
How a DFS works
A DFS clusters together multiple storage nodes and logically distributes data sets across multiple nodes that each have their own computing power and storage. The data on a DFS can reside on various types of storage devices, such as solid-state drives and hard disk drives.
Data sets are replicated onto multiple servers, which enables redundancy to keep data highly available. The DFS is located on a collection of servers, mainframes or a cloud environment over a local area network (LAN) so multiple users can access and store unstructured data. If organizations need to scale up their infrastructure, they can add more storage nodes to the DFS.
Clients access data on a DFS using namespaces. Organizations can group shared folders into logical namespaces. A namespace is the shared group of networked storage on a DFS root. These present files to users as one shared folder with multiple subfolders. When a user requests a file, the DFS brings up the first available copy of the file.
There are two types of namespaces:
- Standalone DFS namespaces. A standalone or independent DFS namespace has just one host server. Standalone namespaces do not use Active Directory (AD). In a standalone namespace, the configuration data for the DFS is stored on the host server's registry. A standalone namespace is often used in environments that only need one server.
- Domain-based DFS namespaces. Domain-based DFS namespaces integrate and store the DFS configuration in AD. Domain-based namespaces have multiple host servers, and the DFS topology data is stored in AD. Domain-based namespaces are commonly used in environments that require higher availability.
Advantages and disadvantages of a DFS
A DFS provides organizations with a scalable system to manage unstructured data remotely. It can enable organizations to use legacy storage to save costs of storage devices and hardware. A DFS also improves availability of data through replication.
However, security measures need to be in place to protect storage nodes. In addition, there is a risk for data loss when data is replicated across storage nodes. It can also be complicated to reconfigure a DFS should an organization replace storage hardware on any of the DFS nodes.
Features of a DFS
Organizations use a DFS for features such as scalability, security and remote access to data. Features of a DFS include the following:
- Location independence. Users do not need to be aware of where data is stored. The DFS manages the location and presents files as if they are stored locally.
- Transparency. Transparency keeps the details of one file system away from other file systems and users. There are multiple types of transparency in distributed file systems, including the following:
- Structural transparency. Data appears as if it's on a user's device. Users are unable to see how the DFS is configured, such as the number of file servers or storage devices.
- Access transparency. Users can access files that are located locally or remotely. Files can be accessed no matter where the user is, as long as they are logged in to the system. If data is not stored on the same server, users should not be able to tell, and applications for local files should also be able to run on remote files.
- Replication transparency. Replicated files that are located on different nodes of the file system, such as on another storage system, are hidden from other nodes in the system. This enables the system to create multiple copies without affecting performance.
- Naming transparency. Files should not change when moving among storage nodes.
- Scalability. To scale a DFS, organizations can add file servers or storage nodes.
- High availability. The DFS should continue to work in the event of a partial failure in the system, such as a node failure or drive crash. A DFS should also create backup copies if there are any failures in the system.
- Security. Data should be encrypted at rest and in transit to prevent unauthorized access or data deletion.
Implementations of a DFS
A DFS uses file sharing protocols. Protocols enable users to access file servers over the DFS as if it was local storage.
Protocols a DFS can use include the following:
- Server Message Block (SMB). SMB is a file sharing protocol designed to allow read and write operations on files over a LAN. It is used primarily in Windows environments.
- Network File System (NFS). NFS is a client-server protocol for distributed file sharing commonly used for network-attached storage systems. It is also more commonly used with Linux and Unix operating systems.
- Hadoop Distributed File System (HDFS). HDFS helps deploy a DFS designed for Hadoop applications.
Open source distributed file systems include the following:
- Ceph. Ceph is open source software designed to enable organizations to distribute data across multiple storage nodes. Ceph is used in many OpenStack implementations.
- GlusterFS. GlusterFS is a DFS that manages multiple disk storage resources into a single namespace.
Vendors that offer DFS products
Various storage vendors offer DFS products and capabilities for unstructured data applications and workloads.
Vendors with DFS products include the following:
- Pure Storage