Maksim Samasiuk - Fotolia
When to choose an S3 big data environment over HDFS storage
Selecting a storage service for big data in the cloud can be challenging. Expert David Loshin explains usage patterns that could lead organizations to Amazon Simple Storage Service.
Enterprises are rapidly adopting hybrid architectures that blend on-premises and cloud-based services to enhance flexibility, extensibility and scalability. Cloud providers deliver services that are increasingly attractive to developers, with costs that appeal to the financial side of the organization.
One of the main challenges, however, can be understanding how cloud providers' functional and foundational capabilities fit together. For example, Amazon Simple Storage Service (S3) doesn't seem to be much more than a cloud-based extension of an organization's persistent storage environment. However, enterprises can also migrate on-premises data warehouse systems to the service to create an S3 big data cloud environment without having to immediately adopt Hadoop and its distributed file system, HDFS.
How Amazon S3 works
In some ways, S3 is somewhat simplistic: It is categorized as an object store that allows you to store collections of data instances -- e.g., file, XML document, etc. -- as their own objects, or documents.
At the same time, as a cloud-based service, many aspects of S3's use are driven by underlying cost factors. The cost of S3 data management is broken down into four line items:
- storage costs -- billed by data volume stored by month;
- request costs -- billed by the number of requests to access objects stored in S3;
- storage management -- billed by the number of objects; and
- transfers -- billed by the volume of data transferred out of S3.
There is no cost to transfer data into S3.
In terms of file organization on S3, in contrast to conventional hierarchical file systems, S3's file system structure is simple: You can create up to 100 Amazon S3 buckets that each hold an unlimited number of objects. One can mimic a hierarchical system with creative naming, as the slash character (/) is valid as part of a name.
For example, one could store a file with all the transaction records from the first quarter of 2016 in a single file named transactions/2016/Q1.json, all the transaction records from the second quarter of 2016 in a single file named transactions/2016/Q2.json, etc. These files can exist together in a specific bucket, and the use of the slash does not imply that there are folders or subfolders in the environment.
Is an S3 big data environment for you?
To decide between an S3 big data cloud environment or HDFS storage, one must consider the usage patterns for data access, especially as the number of data instances subject to analysis increases. Consider these three usage examples:
- Direct access: An individual or an application wants to execute queries and retrieve result sets on demand. An example would be maintaining information updated on a periodic basis that is then used to refresh a website's set of content pages.
- Bulk access: An analytics or data warehousing application wants to integrate the complete contents of a data asset into its environment and access the full collection of objects. In this case, the data accessed is persisted within the consumer's application environment that might be executing on a cloud cluster or on an on-premises server. This requires bulk extraction and movement of the data from S3.
- Intermittent access: The consuming application requires access to a subset of the data asset as part of an operational process, but it does not need to persist that data for an extended period of time. An example might be running a cybersecurity tracing algorithm over the network events within a well-defined time frame. The records for the events within that time frame are accessed and used by the algorithm, but once the algorithm is complete, there is no need to continue to persist the data.
S3 is perfectly adaptable to each of these usage scenarios. Yet, choosing an S3 big data environment is just the first step in the process.
Moving data to S3 may be straightforward, but managing that data requires some additional thought. One must understand that, in each of these cases, there are different implications when it comes to designing a storage architecture -- large collections of objects within a single document or individual documents for each object? -- and naming conventions for documents, as well as monitoring how the usage affects both application performance and cost.