Considerations to plan AI storage architecture for big data
This chapter from 'The Artificial Intelligence Infrastructure Workshop' examines how to plan for AI data storage. Plan for factors like volume and scale for long-term success.
The creation and processing of large amounts of data -- coupled with expanding innovations, such as artificial intelligence and machine learning -- present many opportunities for organizations to make better use of their data and find trends to inform business decisions. But there are also many challenges with AI storage architecture, including planning for large data stores and heavy compute needs.
Planning storage for AI can be a difficult task. Chinmay Arankalle, senior data engineer at Energy Exemplar and co-author of The Artificial Intelligence Infrastructure Workshop, hopes the book will bring clarity on how to implement AI and storage for AI.
The authors wrote that, when designing an AI system, organizations should consider storage requirements upfront to fit the type of analysis they intend to perform. AI storage systems typically need high performance, high scalability and large volume, the authors wrote.
Planning for raw data storage is key
Uses for AI models include analyzing buyer personas and consumer behaviors and analyzing traffic patterns. To properly train AI models, organizations must store a lot of data. The authors wrote that, for AI projects, raw data is often better stored in its entirety.
"In modern AI projects, what we see is that it's best to keep the raw data for as long as possible since (file) storage is cheap, scalable and available in the cloud," the authors wrote. "Moreover, the raw data often provides the best source for model training in a machine learning environment, so it's valuable to give data scientists access to many historical data files."
When storing large data sets, have a plan for storage requirements that meets the needs of the project. The authors recommended storage such as Amazon S3 or Azure Data Lake Storage for large data stores.
More on The Artificial Intelligence Infrastructure Workshop
Be mindful of the long run
As technology evolves, data formats and hardware change. If an organization does not know what it may use the data for in the future, it can be difficult to manage. For example, organizations will have a hard time choosing what data to delete to save storage space, aside from data that must be deleted due to regulations.
As organizations store more data, their AI storage architecture can evolve; they might need to address performance and scalability needs.
"For example, if we are getting data in any unstructured format, then it could be used for data science, or it could be used for some reporting purpose. So, the end goal is not fixed, usually," Arankalle said in a recent interview with TechTarget. "So, maybe now, the data we have stored has some use, [but] after 10 years, the data might have some different use altogether."
Because organizations can use older data to inform newer AI models, it is also important to ensure that older data remains compatible with newer data.
"The challenge ahead of us is how we can utilize the older formats along with the newer ones," Arankalle said. "And, since we can't suddenly get rid of the old data, we need to make sure there is some harmony between older data and the newer data."
To ensure consistency in data formats, regularly revisit formatting and storage requirements throughout the AI project.
"Requirements management is an ongoing process in an AI project," the authors wrote. "Rather than setting all the requirements in stone at the start of the project, architects and developers should be agile, revisiting and revising the requirements after every iteration."
The rising role of data lakes
Data lakes are becoming more popular to store and manage large data sets. Some business intelligence tools benefit from data warehouses because they hold data from multiple systems within an organization. This enables organizations to spot correlations between different metrics, such as CRM data and inventory, for example.
Data warehouses can be more expensive than data lakes because data warehouses are more structured. However, organizations are rapidly generating unstructured data. Data lakes are more compatible with unstructured data but can be more difficult to manage because data comes from more sources and in different formats.
In Chapter 2, "Artificial Intelligence Storage Requirements," the authors dived into the role of data lakes and AI storage. The chapter also covers tips on how to plan and manage storage for AI projects, considerations such as security and availability, and data layers.