Getty Images

Modern architecture, high-quality data key to AI development

An expert explains how components such as a shared foundation and protocols that ensure quality help foster real-time insights and AI-driven analysis.

A modern data architecture is the catalyst for AI-driven insight generation.

Centralized data oversight or isolated data systems were not only obsolete but also hindrances to analysis long before OpenAI's November 2022 launch of ChatGPT, which sparked surging interest in AI development. They prevented even the possibility of deriving timely insights from data.

Now, they are impediments to the real-time analysis needed to compete in the current business climate and to leverage the potential to make workers better informed and more efficient with AI.

A modern data architecture, meanwhile, enables both, fueling real-time analysis as well as facilitating AI development.

At the core of any reliable analytics or AI application is high-quality data. Therefore, part of any modern data architecture -- especially as organizations trust more of their operations to AI -- is tools that address data quality.

More than that, however, a modern data architecture relies on a shared data foundation that connects an organization's data and makes it easily accessible.

Here, Ganapathy 'G2' Krishnamoorthy, vice president and general manager of database services at tech giant AWS, talks about why a modern data architecture is needed and what is included in one. In addition, he spoke about how to ensure data quality, the costs and benefits of investing in a proper infrastructure for AI and some of the barriers organizations face when attempting to modernize.

Editor's note: This Q&A has been edited for clarity and conciseness.

Why has data quality become more critical in the past few years?

Ganapathy 'G2' Krishnamoorthy, vice president and general manager of database services, AWSGanapathy 'G2'
Krishnamoorthy

Ganapathy 'G2' Krishnamoorthy: Good decisions require good data, and if you look at all the efforts to unlock the value of data -- this was true whether you were trying to do self-service BI or machine learning, and now with AI -- the quality of the outcomes is a function of the quality of the data.

You want to make sure that AI applications, large language model [LLM] tools and agents are working with your data and that the data is trustworthy. It is really important for LLMs to work with your best data. This is true for retrieval-augmented generation and other processes.

Before modern cloud-based architectures enabled enterprises to store massive amounts of data and data was kept in on-premises databases, what went into maintaining data quality?

Krishnamoorthy: Enterprises had extract, transform and load [ETL] tools and data quality tools. Data quality was viewed as a [filter] that you were applying as part of preparing the data from operational systems to move to analytics systems. There were ETL jobs that were just moving the data around, and it was important for data engineers or ETL developers to look at the intended usage and the source input and then apply these rules that were opaque.

They were difficult to understand, troubleshoot and work through. It was a much more manual process.

As modern cloud-based architectures became ubiquitous and data volumes grew, what technologies emerged to address data quality?

Krishnamoorthy: There are technology capabilities, but there are also processes.

Technologically, there are now zero ETL tools where you can set the system and tell it to make certain data available for analytics. You can do that in [database systems] in a columnar format so it's transparent and not opaque -- not running in some tool or an ETL job on the side -- and you can configure it. There are also data catalogs where you can understand how data is flowing, see what rules have been applied, when the rules were applied and whether the data is fresh. A person that is looking at a dashboard has the ability to interrogate the input data coming into the dashboard, which is made easier by the catalog.

What processes have changed to better address data quality?

Krishnamoorthy: Ownership of data has changed. Rather than [centralized teams controlling] the data before making it available, application owners can now make data available for analytics. The ownership is changing to make the data more available for analytics and AI. It is becoming the responsibility of the people building the applications. [Data management tools are making] it effortless for the application developers to take on ownership that simplifies the management of the data.

Looking ahead, do you expect new tools that address data quality to emerge?

Krishnamoorthy: If I look at data quality, there is normalization and enrichment, and there is cleansing.

Normalization and enrichment are about trying to reshape the data based on how it is being used. There is a level of automation now possible because of catalogs that can look at how data is used from end to end, so normalization and enrichment will be automated much more than they are today. We're now seeing meaningful first steps being taken. For example, when an application changes its schema to add some feature, we can automatically propagate that into an analytics system so that an application change no longer breaks the analytics side. This has already been launched. The industry term is active metadata.

You don't want to have a stack for your AI users and a separate stack for other users. That adds a tremendous amount of problems.
Ganapathy 'G2' KrishnamoorthyVice president and general manager of database services, AWS

Cleansing is more fussy. It's not strictly rules-based. But it's where generative AI can really help by taking capabilities such as matching that make data trustworthy and making them more powerful.

With AI development now a focal point for many enterprises, what are the components that make up a modern data architecture and help ensure that high-quality data is used to inform AI applications?

Krishnamoorthy: It is really important for organizations to build a shared data foundation that enables not only generative AI use cases and agentic AI use cases but also is the same data foundation that is supporting your other data needs. You don't want to have a stack for your AI users and a separate stack for other users. That adds a tremendous amount of problems. It's important for organizations to think about a shared data foundation. Here are the components:

  • Have a set of online databases that meet your application needs, including the model you are working on and the performance you need. Being comprehensive here is really important.
  • An easy way of bringing data into your analytics and AI environment. Here, an open format like Apache Iceberg has a standard pattern, so you can make sure that all your data can be available to the different environments.
  • The ability to discover and understand your data. This is where a catalog comes in.
  • An environment where all your data can be used to innovate with a variety of tools … such as SageMaker AI for building machine learning models, Bedrock for building GenAI applications, Databricks, Snowflake, Dremio -- a shared data foundation that they can all operate on the same underlying data.
  • As new agentic capabilities are developed and deployed, make sure that these capabilities are available [to the agents]. Model Context Protocol has emerged as the dominant way LLMs are interacting with your data. It's important to provide what is needed for a use case as part of the systems that have been built, rather than having to integrate new systems into your overall architecture, which adds complexity and cost.

If that straightforward infrastructure is known, what is preventing every business from implementing something similar?

Krishnamoorthy: There are a good number of our customers who have made the transition, but it's less about a specific technology than buying into an architecture that includes the notion of zero ETL and a shared data foundation.

When we work with customers, we often find a handful of barriers that we work with them to resolve. One subset has to do with a shift from the way they operate to this shared data foundation. They have to organize change management so the silos that exist can be transformed. A federated approach can help where everything doesn't have to be consolidated and can instead be federated [across organizational domains], and each domain can continue to own their data.

Another thing that's important to note is that no one gets to start fresh. Maybe they have existing Hadoop-based data lakes that enabled a lot of use cases, but the data is not usable now, so there's a transition that has to made from a Hadoop-style database to Iceberg-style lakehouses. Making that technical transition is a barrier.

And a third thing is that customers have lots of individual systems, which creates a lot of complexity with integrations.

Are there tools that can help organizations overcome some of the barriers so that even if they don't completely overhaul their data infrastructure, they can still build AI tools?

Krishnamoorthy: [A catalog such as] SageMaker Catalog can work with all systems if the data is open with Apache Iceberg. Catalogs have the stack that has all the key components for enterprises to build their overall stacks -- the user interface, the individual processing capabilities, the open capabilities and a database with zero ETL. That reduces the integration burden in making the transition.

Also, the cloud itself is a huge enabler because it eliminates the scale and cost involved with setting up something enterprise-wide.

What are the costs for enterprises that don't have the data infrastructure you outlined, and the benefits for those that do have modern data architectures?

Krishnamoorthy: The ability to innovate on all your data helps you understand your business. The ones who can understand their customers better and can understand their opportunities will deliver better products at better economics and have more personalized [interactions] with their customers. It's a sustainable advantage for you to leverage these tools to serve your customers and optimize your business.

GenAI is a more powerful tool to do so, but you could have said the same thing [about data analysis] in previous generations. The really important thing is building the architecture for your data, the shared data foundation, [which] not only serves your GenAI tools but also serves your other needs. It helps you do market segmentation better [and] helps your marketing manager with the best information about your customers in your BI tool, like Tableau or QuickSight.

Beyond making it easier to manage data and develop AI and analytics tools, what is an advantage of a modern data architecture?

Krishnamoorthy: The opportunity to take advantage of data is tremendous, but one of the limitations is people's time itself.

Making the foundation more effortless so people can spend less time managing data and more time unlocking the value is going to be a huge differentiator. A big focus for us is how to make the infrastructure effortless. People's time who understand your business is a limiter, so [it's important to maximize] their time to create value for the business and for customers by leveraging the data infrastructure that is making it effortless to manage data and databases.

Eric Avidon is a senior news writer for Informa TechTarget and a journalist with more than 25 years of experience. He covers analytics and data management.

Dig Deeper on Data management strategies