Organizations that are expanding their traditional infrastructure in the face of big data challenges should consider assembling a big data stack architecture using purpose-built software products and services.
A big data stack is a suite of complementary software technologies used to manage and analyze data sets too large or complex for traditional technologies. Big data stack technologies -- most often applied in analytics -- are specifically designed to address increases in the size, speed and structure of data. Big data products and services are often used to manage data within a data pipeline to deliver timely and efficient business insights.
Organizations can consider several popular big data stacks, each with a set of technologies and open source alternatives. Whether they choose a packaged stack or build their own, big data stacks have become staples of modern data architectures.
Big data challenges
Big data is often described in terms of scale and complexity, which present unique challenges referred to as the three Vs:
- Volume. The amount of data, both human and machine generated. Typically, machine-generated data, such as sensor data, generates significantly larger quantities than human-generated transactional data. The massive amount of data at rest and in motion presents a challenge for organizations.
- Velocity. The rate of data. Machine-generated data is often produced at higher frequencies than its human counterpart. The challenge is the tremendous speed at which organizations collect and process the data, particularly in a real-time streaming architecture.
- Variety. The diversity of data. The three main forms data takes include structured, semistructured and unstructured. The wide range of structurally different data -- which often requires different approaches -- brings new challenges.
Any of the three Vs traditional software technologies can't handle are considered big data.
Big data stack architecture layers
To address the challenges of big data, organizations must look beyond traditional data processing infrastructure. One area of recourse is special purpose big data software technologies. When used in concert, big data technologies can mitigate the effect of big data.
The following six layers are key to a successful big data stack architecture:
The first step of a big data stack architecture is data collection. Data acquisition can either push or pull from a wide range of internal and external data sources. Some examples of data sources include transactional systems, IoT devices, social media and static log files.
Big data ingestion software handles large static data sets, small real-time data sets and various data formats of each. The large data sets arrive slowly and the small sets arrive quickly. Postponing schema and quality validation to further along in the pipeline aids in higher throughput.
Once collected, raw data is typically stored as files in a data lake optimized for getting data into the analytics pipeline. Native format repositories are both landing zones for batch data and sandboxes for time sensitive exploratory inquiry.
Big data storage software stores both large and small files with various formats, which typically takes the form of a distributed file system such as object storage. Non-transient data can persist with long retention periods and requires software with automated tiering throughout the data's lifecycle.
Processing involves the preparation of batch data sets at rest and streaming data in motion for analysis. Data curation can include cleaning, conforming, enriching, integrating, filtering, aggregating and otherwise preparing data for analysis.
Big data processing software operates on large batches of data, with higher latency and more complex computation, which requires long-running high-efficiency computing. Using distributed processing software acting on smaller pieces of partitioned data can achieve this.
Big data processing software also operates on high-speed streaming data, with lower latency and relatively simple computation. Streaming data processing requires guaranteed durability, order and delivery achieved through a continuously available streaming service.
Achieve batch and stream performance through software parallelism, in-place processing and schema-on-read. Key big data stack tactics include the division of data and processing into small units with simultaneous execution and minimizing schema validation during analytic store loading.
4. Analytic stores
Analytic data stores process or refine data for analysis. Examples of data stores include SQL-based dimensional data warehouses, NoSQL technologies and distributed data stores with an abstraction layer for access to varying data types through interfaces.
Big data analytic stores support diverse storage approaches and technologies, referred to as polyglot persistence. Special purpose single-model databases deliver performance and scalability by optimizing data storage and processing a particular data type. Essential tactics include data processing, parallel execution and data partitioning.
Analysis examines analytic data stores and raw storage. Human users in an interactive setting use BI tools to gain insights through visualization. Advanced analytic tools crunch data to extract intelligence. Machine learning uses artificial intelligence to work directly on the data to teach itself.
Big data analysis software handles the inquiries from simple ad hoc queries to sophisticated predictive analytics and machine learning operations. The range of users includes casual analysts, data scientists and machines. Because data is often decentralized and in-place analysis is essential, software should present a unified view of the data ecosystem to users through virtualization of a data fabric.
Big data stacks often use workflow technology to manage data operations such as source data collection, raw data storage and data processing. Operations also include the movement of refined data to analytic data stores and the push of insights directly to business intelligence applications, such as reports and dashboards.
Big data orchestration software automates the data pipeline, which minimizes latency and achieves time to value. Workflow software provides an easy-to-use administrative interface and seamless integrations between architecture components.
Choosing a big data stack
Before choosing a big data technology or stack, organizations should quantify their current and future data challenges, understand the limitations of traditional software and note big data industry trends. They should periodically revisit their assessments as big data and technology evolution are moving targets.
It is important to ensure technology choices are modular and loosely coupled to allow for changes in a plug-and-play strategy with minimal or no effect on other stack software. Focus on software specifically designed to solve a unique challenge in architecture, not multipurpose software.
The checklist in Figure 1 is a good starting point to evaluate big data software.
Data-driven organizations understand handling big data is a core competency. Special purpose big data software can address the challenges of data at scale and complexity. Together with traditional data software, big data stacks help manage data and deliver timely business insights.
Jeff McCormick is a former enterprise architecture IT principal at a major health services company. He has worked in data-related IT roles for more than 30 years and holds AWS and Azure cloud certifications, and Oracle, Microsoft and Sybase database certifications.