Big data management is the organization, administration and governance of large volumes of both structured and unstructured data.
The goal of big data management is to ensure a high level of data quality and accessibility for business intelligence and big data analytics applications. Corporations, government agencies and other organizations employ big data management strategies to help them contend with fast-growing pools of data, typically involving many terabytes or even petabytes stored in a variety of file formats. Effective big data management particularly helps companies locate valuable information in large sets of unstructured and semistructured data from various sources, including call detail records, system logs, sensors, images and social media sites.
Most big data environments go beyond relational databases and traditional data warehouse platforms to incorporate technologies that are suited to processing and storing nontransactional forms of data. The increasing focus on collecting and analyzing big data is shaping new data platforms and architectures that often combine data warehouses with big data systems.
As part of the big data management process, companies must decide what data must be kept for compliance reasons, what data can be disposed of and what data should be analyzed in order to improve current business processes or provide a competitive advantage. This process requires careful data classification so that, ultimately, smaller sets of data can be analyzed quickly and productively.
Top challenges in managing big data
Big data is usually complex -- in addition to its volume and variety, it often includes streaming data and other types of data that are created and updated at a high velocity. As a result, processing and managing big data are complicated tasks. For data management teams, the biggest challenges faced on big data deployments include the following:
- Dealing with the large amounts of data. Sets of big data don't necessarily need to be large, but they commonly are -- and in many cases, they're massive. Also, data frequently is spread across different processing platforms and storage repositories. The scale of the data volumes that typically are involved makes it difficult to manage all of the data effectively.
- Fixing data quality problems. Big data environments often include raw data that hasn't been cleansed yet, including data from different source systems that may not be entered or formatted consistently. That makes data quality management a challenge for teams, which need to identify and fix data errors, variances, duplicate entries and other issues in data sets.
- Integrating different data sets. Similar to the challenge of managing data quality, the data integration process with big data is complicated by the need to pull together data from various sources for analytics uses. In addition, traditional extract, transform and load (ETL) integration approaches often aren't suited to big data because of its variety and processing velocity.
- Preparing data for analytics applications. Data preparation for advanced analytics can be a lengthy process, and big data makes it even more challenging. Raw data sets often must be consolidated, filtered, organized and validated on the fly for individual applications. The distributed nature of big data systems also complicates efforts to gather the required data.
- Ensuring that big data systems can scale as needed. Big data workloads require a lot of processing and storage resources. That can strain the performance of big data systems if they aren't designed to deliver the required processing capacity. It's a balancing act, though: Deploying systems with excess capacity adds unnecessary costs for businesses.
- Governing sets of big data. Without sufficient data governance oversight, data from different sources might not be harmonized, and sensitive data might be collected and used improperly. But governing big data environments creates new challenges because of the unstructured and semistructured data they contain, plus the frequent inclusion of external data sources.
Best practices for big data management
Done well, big data management sets the stage for successful analytics initiatives that can help drive better business decision-making and strategic planning in organizations. Here's a list of best practices to adopt in big data programs to put them on the right track:
- Develop a detailed strategy and roadmap upfront. Organizations should start by creating a strategic plan for big data that defines business goals, assesses data requirements and maps out applications and system deployments. The strategy should also include a review of data management processes and skills to identify any gaps that need to be filled.
- Design and implement a solid architecture. A well-designed big data architecture includes various layers of systems and tools that support data management activities, from ingestion, processing and storage to data quality, integration and preparation work.
- Stay focused on business goals and needs. Data management teams must work closely with data scientists, other analysts and business users to make sure that big data environments meet business needs for information to enable more data-driven decisions.
- Eliminate disconnected data silos. To avoid data integration problems and ensure that relevant data is accessible for analysis, a big data architecture should be designed without siloed systems. It also offers the opportunity to connect existing data silos as source systems so they can be combined with other data sets.
- Be flexible on managing data. Data scientists commonly need to customize how they manipulate data for machine learning, predictive analytics and other types of big data analytics applications -- and in some cases, they want to analyze full sets of raw data. That makes an iterative approach to data management and preparation essential.
- Put strong access and governance controls in place. While governing big data is a challenge, it's a must, along with robust user access controls and data security protections. That's partly to help organizations comply with data privacy laws regulating the collection and use of personal data, but well-governed data can also lead to higher-quality and more accurate analytics.
Big data management tools and capabilities
There's a wide variety of platforms and tools for managing big data, with both open source and commercial versions available for many of them. The list of big data technologies that can be deployed, often in combination with one another, includes distributed processing frameworks Hadoop and Spark; stream processing engines; cloud object storage services; cluster management software; NoSQL databases; data lake and data warehouse platforms; and SQL query engines.
To enable easier scalability and more flexibility on deployments, big data workloads increasingly are being run in the cloud, where businesses can set up their own systems or use managed services offerings. Prominent big data management vendors include cloud platform market leaders AWS, Google and Microsoft, plus Cloudera, Databricks and others that focus mainly on big data applications.
Mainstream data management tools are also key components for managing big data. That includes data integration software supporting multiple integration techniques, such as traditional ETL processes; an alternative ELT approach that loads data as is into big data systems so it can be transformed later as needed; and real-time integration methods, such as change data capture. Data quality tools that automate data profiling, cleansing and validation are commonly used, too.