IT shops continue to be challenged by the unbridled growth of their organization's data stores. Data managers and big data specialists need to capture, process and present an ever-increasing amount of data to end users for both operational and analytics applications.
For this reason, many organizations are moving their data management infrastructure to the cloud, which offers the benefits of scalability and managed services. But with so many options from different vendors available, it's important to know what you're getting into before making the database cloud migration leap.
Data volumes that were rare a few years ago are now commonplace. Day-to-day operational systems are storing such large amounts of data that they rival data warehouses in disk storage and processing complexity. Advancements in IoT, machine learning and artificial intelligence technologies are also contributing to an astonishing amount of data generation.
An IDC study sponsored by storage vendor Seagate states that the total volume of data we store globally will grow from 33 zettabytes in 2018 to 175 zettabyes by 2025. The study also says that almost 30% of the world's data will require real-time processing, and IDC estimates in a separate report that the IoT segment alone will create 79.4 zettabytes of new data in 2025.
This article is part of
Vendors of all sizes are trying to capitalize on this rapid growth by offering a wealth of big data products to the IT community. IDC also forecasts that the combined big data and business analytics market will experience double-digit annual growth for at least the next few years, reaching $274 billion in revenue by 2022. Evaluating the seemingly endless array of cloud data pipeline offerings can be challenging to even the savviest big data experts.
Cloud migration questions to ask
Shops weighing an on-premises to cloud big data pipeline migration need to begin their analysis by answering the following fundamental questions about their planned cloud database and big data architecture. Doing so will help you plan your move and avoid common migration mistakes.
- Do we migrate our system as it currently exists, or should we rethink the architecture?
- Is our current architecture meeting our needs?
- Is it flexible enough to satisfy our organization's future big data analytics requirements?
- Will our current system allow us to easily leverage new data pipeline products and advancements in big data technologies?
- Do we design, install and administer our own architecture, or do we rent the platform from a cloud provider?
- How will transferring data to the cloud affect our environment?
- What modifications will we need to make to our current systems and network architecture to enable high-performance data transfers to the cloud? (The analysis should include separate evaluations for batch and streaming data transfers.)
- How will the cloud impact our adherence to governmental, industry-specific or organizational regulatory compliance frameworks like PCI, HIPAA, NIST, NERC and GDPR?
- What additional costs will we incur in the cloud? (Examples include staff training, organizational changes and infrastructure enhancements, as well as possible increases in processing costs.)
Cloud data pipeline platform offerings
To help when you're considering your database cloud migration options, let's review the database and big data offerings of the top three super-sized cloud platform providers. This will be only a brief introduction to the data pipeline products that are available in the cloud, though -- your technology evaluation may also include many others.
Microsoft's Azure HDInsight provides a robust suite of Apache open source technologies, including Hadoop, Spark, Hive LLAP, Kafka, HBase and Storm. In addition, the HDInsight platform offers Microsoft's ML Services for R-based analytics and machine learning applications.
Microsoft also offers the Azure Data Lake repository and analytics platform, plus tools such as Azure Event Hubs for real-time streaming data ingestion and Azure Databricks for Spark-based analytics. Its lineup of cloud databases includes Azure SQL Database, a cloud-based offshoot of the company's flagship SQL Server relational database, as well as the Azure Cosmos DB multi-model database and the Azure Synapse Analytics data warehouse.
AWS offers a wide range of big data storage and data pipeline products in its cloud. Its primary big data platform is Amazon EMR, which provides customers with access to Amazon EC2 compute resources and S3 object storage, as well as the Hadoop Distributed File System (HDFS) and connections to the Amazon DynamoDB key-value database. Amazon EMR includes big data tools such as Spark, Hive, HBase, Flink and Hudi.
There's also the Amazon Redshift data warehouse, the Amazon Kinesis real-time data streaming system and a variety of database engines, including the Amazon Aurora relational software; NoSQL systems like DynamoDB, Amazon DocumentDB and the Amazon Neptune graph database; and specialized in-memory, time series and ledger database services.
Google also offers a robust big data product suite that covers every aspect of big data management and analytics for users of its cloud platform. Its products include Cloud Dataflow for processing both streaming and batch data sets, Cloud Pub/Sub for event data ingestion and Cloud Dataproc, which is a fully managed Apache Hadoop and Spark service.
Also available from Google are the BigQuery data warehouse and a set of database technologies that includes the Cloud SQL and Cloud Spanner relational database services and the Cloud Bigtable and Cloud Firestore NoSQL engines.
In addition to the examples above, a few of the more popular competitors include big data platform vendor Cloudera as well as IBM, Oracle and Qubole.
Evaluating data pipeline platforms
Organizations should follow a standardized evaluation methodology to facilitate the process of selecting a data pipeline product. Evaluation best practices include choosing an appropriate evaluation team, performing a thorough needs analysis and creating a robust set of weighted evaluation metrics.
Enterprises should also educate themselves on all of the available Hadoop distributions. Each offers distinct features and options that provide customizations best suited for different compute environments.
Organizations now have more database cloud migration options available to them than ever before. To correctly design and implement the most appropriate data pipeline architecture for their organization, big data specialists and other data management professionals must evaluate and compare large data store ecosystems as well as individual products.