Big data storage is a compute-and-storage architecture that collects and manages large data sets and enables real-time data analytics.
Companies apply big data analytics to get greater intelligence from metadata. In most cases, big data storage uses low-cost hard disk drives, although moderating prices for flash appear to have opened the door to using flash in servers and storage systems as the foundation of big data storage. These systems can be all-flash or hybrids mixing disk and flash storage.
The data itself in big data is unstructured, which means mostly file-based and object storage.
Although a specific volume size or capacity is not formally defined, big data storage usually refers to volumes that grow exponentially to terabyte or petabyte scale.
The big promise behind big data
Several factors have fueled the rise of big data. People now store and keep more information than ever before due to widespread digitization of paper records among businesses. The proliferation of sensor-based Internet of Things (IoT) devices has led to a corresponding rise in the number of applications based on artificial intelligence (AI), which is an enabling technology for machine learning. These devices generate their own data without human intervention.
A misconception about big data is that the term refers solely to the size of the data set. Although this is true in the main, the science behind big data is more focused. The intention is to mine specific subsets of data from multiple, large storage volumes. This data may be widely dispersed in different systems and may not have an obvious correlation. The objective is to unify the data with structure and intelligence to allow it to be rapidly analyzed.
The ability to collect different data from various sources, and place those associations in an understandable context, allows an organization to glean details that are not readily apparent otherwise. The analysis is used to inform decision-making, such as examining online browsing behavior to tailor product and services to a customer's habits or preferences.
Big data analytics has paved the way for DevOps organizations to emerge as a strategic analytics arm within many enterprises. Companies in finance, health care and energy need to analyze data to pinpoint trends and improve business functions. In the past, businesses were limited to using a data warehouse or high-performance computing (HPC) cluster to parallelize batch processing of structured data, a process that could take days or perhaps weeks to complete.
By contrast, big data analytics processes large semi-structured or unstructured data and streams the results within seconds. Google and Facebook exploit fast big data storage to serve targeted advertising to users as they surf the Internet, for example. A data warehouse or HPC cluster may be used separately as an adjunct to a big data system.
Analyst firm IDC estimates the market for big data storage hardware, services and software to generate $151 billion in 2017, achieving through 2020 a compound annual growth rate of nearly 12%, when revenues are pegged to hit $210 billion.
Big data storage demand will approach 163 zettabytes by 2025, according to a separate 2017 report by IDC and Seagate. The report attributes the growth to increased use of cognitive computing, embedded systems, machine learning, mobile devices and security.
The components of big data storage infrastructure
A big data storage system clusters a large number of commodity servers attached to high-capacity disk to support analytic software written to crunch vast quantities of data. The system relies on Massively Parallel Processing databases to analyze data ingested from a variety of sources.
Big data often lacks structure and comes from various sources, making it a poor fit for processing with a relational database. The Apache Hadoop Distributed File System (HDFS) is the most prevalent analytics engine for big data, and is typically combined with some flavor of a NoSQL database.
Ben Woo, managing editor of Neuralytix Inc., discusses Hadoop and storage for big data projects.
Hadoop is open source software written in the Java programming language. HDFS spreads the data analytics across hundreds or even thousands of server nodes without a performance hit. Through its MapReduce component, Hadoop distributes processing in this way as a safeguard against catastrophic failure. The multiple nodes serve as a platform for data analysis at a network's edge. When a query arrives, MapReduce executes processing directly on the storage node on which data resides. Once analysis is completed, MapReduce gathers the collective results from each server and “reduces” them to present a single cohesive response.
How big data storage compares to traditional enterprise storage
Big data can bring an organization a competitive advantage from large-scale statistical analysis of the data or its metadata. In a big data environment, the analytics mostly operate on a circumscribed set of data, using a series of data mining-based predictive modeling forecasts to gauge customer behaviors or the likelihood of future events.
Statistical big data analysis and modeling is gaining adoption in a cross-section of industries, including aerospace, environmental science, energy exploration, financial markets, genomics, healthcare and retailing. A big data platform is built for much greater scale, speed and performance than traditional enterprise storage. Also, in most cases, big data storage targets a much more limited set of workloads on which it operates.
For example, your enterprise resource planning systems might be attached to a dedicated storage area network (SAN). Meanwhile, your clustered network-attached storage (NAS) supports transactional databases and corporate sales data, while a private cloud handles on-premises archiving.
It's not uncommon for larger organizations to have multiple SAN and NAS environments that support discrete workloads. Each enterprise storage silo may contain pieces of data that pertain to your big data project.
Antony Adshead, a site editor at Computer Weekly, discuss what defines big data and the key attributes required of big data storage.
Legacy storage systems handle a broader number of application workloads. The generally accepted industry practice in primary storage is to assign an individual service level to each application to govern availability, backup policies, data access, performance and security. Storage used for production -- the activities a company uses daily to generate revenue -- demands high uptime, whereas big data storage projects can tolerate higher latency.
The three Vs of big data storage technologies
Storage for big data is designed to collect voluminous data produced at variable speeds by multiple sources and in varied formats. Industry experts describe this process as the three Vs: the variety, velocity and volume of data.
Variety describes the different sources and types of data to be mined. Sources include audio files, documents, email, file storage, images, log data, social media posts, streaming video and user clickstreams.
Velocity pertains to the speed at which storage is able to ingest big data volumes and run analytic operations against it. Volume acknowledges that modern applications scripts are large and growing larger, outstripping the storage capabilities of existing legacy storage.
Some experts suggest big data storage needs to encompass a fourth V: veracity. This involves ensuring that the data sources being mined are verifiably trustworthy. A major pitfall of big data analytics is that errors tend to be compounded, through corruption, user error or other causes. Veracity may be the most important element and the toughest issue to solve, in many cases possible only after a thorough data cleansing of databases.
How machine learning affects big data storage
Machine learning is a branch of AI whose rising prominence mirrors that of big data analytics. Trillions of data points are generated each day by AI-based sensors embedded in IoT devices ranging from automobiles to oil wells to refrigerators.
In machine learning, a computing device produces analysis without human intervention. Iterative statistical analytics models apply a series of mathematical formulas. With each computation, the machine learns different pieces of intelligence that it uses it to fine-tune the results.
The theory of machine learning is that the analysis will grow more reliable over time. Google's self-driving car is an example of machine learning in the corporate world, but consumers use it when they click on a recommended streaming video or receive a fraud-detection alert from their bank.
Most machine data exists in an unstructured format. Human intellect alone is not capable of rendering this data in context. Making sense of it requires massively scalable, high-performance storage, overlaid with powerful software intelligence that imposes structure on the raw data and extracts it in a way that is easy to digest.
Building custom big data storage
Big data storage architecture generally falls into geographically distributed server nodes -- the Hadoop model -- or using scale-out NAS or object systems. Each has its advantages and disadvantages. Depending on the nature of the big data storage requirements, a blend of several systems might be used to build out your infrastructure.
Hyper-scale cloud providers often design big data storage architecture around mega-sized server clusters using direct-attached storage. In this arrangement, PCI Express flash might be placed in the server for performance and surrounded with just a bunch of disk to control storage costs. This type of architecture is geared for workloads that seek and open hundreds of small files. Consequently, the setup comes with limitations, such as the inability to deliver shared storage among users and the need to add third-party data management services.
As big data implementations have started to mature, the prospect of a do-it-yourself big data storage platform is not as daunting as it once was, although it is not a task to undertake lightly. It requires taking stock of internal IT to determine if it make sense to build a big data storage environment.
An enterprise IT staff will shoulder the burden of drawing up hardware specs and building the system from scratch, including sourcing components, testing and managing the overall deployment. Hadoop integration can be challenging for enterprises used to relational database platforms.
Big data storage also requires a development team to write code for in-house analytics software or to integrate third-party software. Enterprises also need to weigh the cost-to-benefit ratio of creating a system for a limited set of applications, compared to enterprise data storage that handles a more diverse range of primary workloads.
Buying big data storage: scale-out NAS, object storage
Clustered and scale-out NAS provides shared access to parallel file-based storage. Data is distributed among many storage nodes and capacity scales to billions of files, with independent scaling of compute and storage. NAS is recommended for big data jobs involving large files. Most NAS vendors provide automated tiering for data redundancy and to lower the cost per gigabyte.
Data Direct Networks SFA, Dell EMC Isilon, NetApp FAS, Qumulo QC and Panasas ActiveStor are among the leading scale-out NAS arrays.
Similar to scale-out NAS, object storage archive systems can extend to support potentially billions of files. Instead of a file tree, an object storage system attaches a unique identifier to each file. The objects are presented as a single managed system within a flat address space.
Most legacy block storage and file storage vendors have added object storage to their portfolios. A number of newer vendors sell object-based storage with native file support.
A data lake is sometimes considered an outgrowth of object storage, although critics deride the term as a marketing ploy. Frequently associated with Hadoop-based big data storage, a data lake streamlines the management of non-relational data scattered among numerous Hadoop clusters.
Big data storage security strategies
Several issues of big data storage security exist, with no single solution to the matter. NoSQL databases are hailed for the ability to ingest, manage and quickly process a flood of complex data. Aside from embedding some inherent security basics, NoSQL security is not as robust as more mature relational databases, underscoring the need to surround a big data project with data management and data protection tools for critical data.
Owing to machine learning, big data projects often unearth information from metadata that is not obvious from examining the source data. This could result in the inadvertent exposure of personally identifiable information of customers, partners or others. Stronger access controls, network encryption and perimeter security each play a role, along with scheduled periodic checks to ensure sustained data integrity.
Big data storage likely will necessitate an ongoing audit of internal data governance, in addition to compulsory regulatory compliance. Emerging technologies such as homomorphic encryption -- to date associated mostly with data security for public cloud services -- could be another piece in solving the puzzle of big data storage security.
Whether the choice is made to build or buy, the success of a big data storage project hinges on choosing relevant data for analysis. A predictive model can quickly go off the rails if confirmation bias or other errors influence which data is selected. The onus should be placed on developing the most accurate data models possible to avoid a “garbage in, garbage out” syndrome.