Traditional data stacks lack the flexibility and scalability that cloud technology provides the modern data stack. However, on-premises data stacks can hold several benefits over their cloud counterparts.
A data stack is the set of platforms, tools and other technologies that enable organizations to collect, store and use their data. Traditionally, a data stack was on premises in a company's data center and made heavy use of relational databases or data warehouses.
The modern data stack uses cloud storage and advanced analytics tools to create a more flexible and scalable option that each organization can tailor to its specific needs.
While often associated with legacy technologies, on-premises stacks can add advanced analytics tools and should not be discounted simply due to the lack of cloud. Organizations should evaluate their needs and may find an on-premises data stack suits them better than a modern one.
How a data stack works
A data stack is like a supply chain, but for data instead of physical goods. Just like a physical supply chain, a data stack can involve several specialized tools, technologies and frameworks, said Bob Parr, chief data officer at KPMG US.
For example, a data stack can include tools that assess and remediate data quality and normalize data with common codes. It can also include tools that structure the data properly for storage, aggregation and distribution for analytics, reporting, visualizations and insight generation. No single vendor or set of services can cover all of these, Parr said.
Almost every organization has some form of data stack. Today, most are cloud-enabled. A typical example of a modern data stack could look like this:
- Azure Data Factory or AWS Glue Data for data ingestion.
- Informatica's Intelligent Data Management Cloud or AWS Glue Data Brew for data quality.
- Amazon Web Services, S3 bucket, MongoDB Atlas or Azure Data Lake for data storage.
- Apache Hadoop, Apache Spark or Data Bricks for data processing or transformation.
- The Python programming language and its libraries such as Pandas and NumPy or Dataiku for data analysis.
- Tableau or Power BI for data visualization.
Each of the above options offers a suite of cloud services to address most of an organization's needs, Parr said.
Cloud-based data stack provider benefits
When organizations have a deep relationship with a primary hyperscaler such as Microsoft, AWS or Google, they tend to align the rest of their data stack with that particular cloud provider, Parr said.
Going with a single cloud provider can often include tradeoffs. For example, the cloud provider's tools might be simpler to integrate and have more predictable cost structures. However, they may not offer the best-in-class functionality for every single component.
Benefits of a cloud-based data stack include scalability, increased accessibility, integrated analytics, machine learning capabilities, and reduced infrastructure and maintenance costs, Parr said.
These commercial data sources can augment an organization's own data to improve analytics. For example, there are companies offering economic data, weather data, supply chain data, competitive benchmarks and more.
Do traditional data stacks have a place?
In the past, a traditional data stack was just a database, said Holger Mueller, vice president and principal analyst at Constellation Research. Over time, it grew to include file systems, as well as tools for data integration, quality, cleansing and deduplication. This evolution led to data warehouses and lakehouses.
The use of the on-premises data stack is declining, but they still offer some benefits. Traditional data stacks offer administrators an increased level of control over the data infrastructure, and companies can tailor the stack to their own needs and security requirements, Mueller said.
"This control can be particularly important for companies that handle sensitive data and have strict compliance regulations," Parr said.
Companies that require real-time processing or have high throughput demands can run their own stack to maintain consistent performance levels, he said.
Using an on-premises data stack doesn't mean organizations have to use legacy technologies. Legacy generally refers to outdated tools and processes that lack scalability, flexibility and advanced features.
"They may require manual maintenance, have limited integration capabilities and struggle to handle large volumes of data or complex analytical tasks," Parr said.
A typical traditional data stack like this might use the following:
- SQL for data ingestion.
- Informatica Data Quality for data quality.
- Microsoft Access, Db2 or flat files for data storage.
- SAS or IBM SPSS for data processing and transformation.
- Microsoft Excel for data analysis.
- Excel and PowerPoint for data visualization.
Organizations using an on-premises data stack can upgrade it with advanced analytics tools. Data science teams use analytics and machine learning through on-premises BI and analytics systems.
"You'll ingest data using an on-premises data-integration system that handles extract, transform, load from a mainframe or enterprise planning system," said Doug Henschen, vice president and principal analyst at Constellation Research. Then an on-premises data warehouse or data lake platform, such as Hadoop or Databricks, manages the data.
Beyond control and security, these systems offer another advantage over cloud-based data stacks -- they can cost less. With variable workloads, cloud deployments make sense. When there's a need for more processing power for a short time, cloud scales up easily and then scales back down when the demand is over. The cost depends on what an organization uses.
But if companies have some workloads that are stable and predictable, they are moving them back on premises, "particularly if they never gave up their data centers and moved entirely into the cloud," Henschen said.
Another use for traditional data stacks is for large, slow-moving data, said Bhrugu Pange, managing director who leads the technology services group at AArete, a management consulting firm.
This kind of data stack typically uses a relational database such as Oracle, Microsoft SQL Server or PostgreSQL for data storage, he said. For extract, transform and load functions, it uses tools such as Microsoft SSIS, Informatica or Talend for data integration. For data analysis and visualization, add tools such as Tableau, Qlik or Power BI.
"While this stack can support large volumes, it will typically not support high-frequency, real-time, or streaming, analytics and event processing," he added.