everythingpossible - Fotolia
The tools of the trade for data professionals have undergone significant revision in recent years. Cloud workloads and data lakes, particularly, have challenged traditional models of information architecture.
In this Q&A, the first of two parts from an interview with data management expert William McKnight, the president of McKnight Consulting Group looks at some of the ramifications.
One of the first targets for data on the cloud seemed to be data warehouses. Amazon's Redshift broke the ground for that. Isn't the data warehouse the first step in tackling cloud workloads for a lot of shops?
McKnight: I'd agree. But it is not exclusive to the data warehouse by any means. It's all over the place. Arguably, Salesforce has been a strong first port to the cloud, though it is not your database, really -- it's their database.
But, nevertheless, it is a sure sign of moving to the cloud. There are a lot of operational databases. But, in so many cases, the data warehouse is a strong candidate for the thing to move next -- that is no matter where organizations are with their overall rollout to the cloud. A lot of companies are under cloud mandates, so there are a lot of things moving at once.
You've talked in the past about effective information architecture. How do you see it done today when people have cloud applications to feed, data lakes on their premises and so on.
McKnight: The big trend has only grown -- that is that there are a lot of different data stores out there that are relevant to an enterprise.
A big key in the success overall of your data program is matching the workload to the right data platform, and, today, you have a lot of options. It's easy to get it wrong, and it can be challenging to get it right. So, it's important to be on top of the different possibilities and not to just keep reaching for the same hammer every time you have a new workload.
William McKnightpresident, McKnight Consulting Group
The new workloads are coming fast and furiously as companies realize data is what sets them apart, and they want to capitalize on it. What project doesn't need a lot of good, quality data?
The cloud is ready for data. Anything can be put there, you can treat it like a data center if you want, but then there are databases that have been built specially for the cloud or that have been re-engineered to work with the cloud. They are gaining elasticity, having separation of compute and storage, rich SQL, chargeback, hands-off service -- the things that you would expect from databases that are going to give you scale. People will want to go there with both their analytics and with their operational workloads.
Let's look at another aspect of modern information architecture. Data preparation in the face of the data lake seems to be undergoing some changes -- is that fair to say?
McKnight: The data lake has had low levels of curation -- I'd say historically, but there is not a lot of history there. That's just due to the high-velocity nature of the data and the fact that you're probably not going to use the data for reports or things like that. It's more about diamonds in the rough that you're going to be looking at.
But it really does behoove the companies that are putting data in there to be sure the data is fit for purpose. Now, that may not mean 100% data quality of the kind we strive for in the data warehouses we use for bet-your-business reporting. Nonetheless, a data quality program over the data lake is important. Organizations should at least know the quality of the data that they are putting in there. Furthermore, if that data is moving on to other places where it is going to have a high calling to the organization, all the more reason to get it right as it moves in.
That makes it an ongoing program. You have to constantly raise the bar on the quality of the data. You have to attend to it. But keep in mind that the data scientists are possibly going to be working on that data, too. We find, sometimes, there is a divide between the data scientist and the architect of the data lake. If the right hand doesn't know what the left hand is doing, of course that should be remedied.
I have become a strong believer in data lakes. They are a staging ground for the data warehouse and -- probably more important down the road -- they are a data bed for data science in the organization.
Data lakes are a place where people can set themselves apart and, increasingly, that is the place where artificial intelligence will get its data from. That is the place for the organization to exercise its algorithms -- to come up with things you are just not going to come up with otherwise -- and those are possibly real competitive advantages.