Alluxio has launched Alluxio 2.0, a platform designed for data engineers who manage and deploy analytical and AI workloads in the cloud.
According to Alluxio, the 2.0 version was built particularly with hybrid and multi-cloud environments in mind, with the aim of providing data orchestration to bring data locality, accessibility and elasticity to compute.
Alluxio 2.0 Community Edition and Enterprise Edition provide a handful of new capabilities, including data orchestration for multi-cloud, compute-optimized data access for cloud analytics, AWS support and architectural foundations using open source.
Data orchestration for multi-cloud
There are three main components to the data orchestration capabilities of Alluxio 2.0: policy-driven data management, administration of data access policies and cross-cloud storage data movement using data service.
Policy-driven data management enables data engineers to automate data movement across different storage systems based on predefined policies. Users can also automate tiering of data across any environment or any number of storage systems. Alluxio claims this will reduce storage costs because the data platform teams will only manage the most important data in the expensive storage systems, while moving less important data to cheaper alternatives.
The administration of data access policies enables users to configure policies at any directory or folder level to streamline data access and workload performance. This includes defining behaviors for individual data sets for core functions, such as writing data or syncing it with Alluxio storage systems.
With cross-cloud storage data movement using data service, Alluxio claims users get highly efficient data movement across cloud stores, such as AWS S3 and Google Cloud services.
Compute-optimized data access for cloud analytics
The compute-optimized data access capabilities include two components: compute-focused cluster partitioning and integration with external data sources over REST.
Compute-focused cluster partitioning enables users to partition a single Alluxio cluster based on any dimension. This keeps data sets within each framework or workload from being contaminated by the other. Alluxio claims that this reduces data transfer costs and constrains data to stay within a specific region or zone.
Integration with external data sources over REST enables users to import data from web-based sources, which can then be aggregated in Alluxio to perform analytics. Users can also direct web locations with files to Alluxio to be pulled in as needed.
The new suite provides Amazon Elastic MapReduce (EMR) service integration. According to Alluxio, Amazon EMR is frequently used during the process of moving to cloud services to deploy analytical and AI workloads. Amazon EMR is now available as a data layer within EMR for Spark, Presto and Hive frameworks.
Architectural foundations using open source
According to Alluxio, core foundational elements have been rebuilt using open source technologies. RocksDB is now used for tiering metadata of files and objects for data that Alluxio manages to enable hyperscale. Alluxio uses gRPC as the core transport protocol for communication with clusters, as well as between the client and master.
In addition to the main components, other new features include the following:
- Alluxio Data Service: A distributed clustered service.
- Adaptive replication: Configures a range for the number of copies of data stored in Alluxio that are automatically managed.
- Embedded journal: A fault tolerance and high availability mode for file and object metadata that uses the RAFT consensus algorithm and is separate from other external storage systems.
- Alluxio POSIX API: A Portable OS Interface-compatible API that enables frameworks such as Tensorflow, Caffe and other Python-based models to directly access data from any storage system through Alluxio using traditional access.
Alluxio 2.0 Community Edition and Enterprise Edition are both generally available now.