Organizations are implementing AI projects for numerous applications in a wide range of industries. These applications include predictive analytics, pattern recognition systems, autonomous systems, conversational systems, hyper-personalization activities and goal-driven systems. Each of these projects has something in common: They're all predicated on an understanding of the business problem and that data and machine learning algorithms must be applied to the problem, resulting in a machine learning model that addresses the project's needs.
Deploying and managing machine learning projects typically follow the same pattern. However, existing app development methodologies don't apply because AI projects are driven by data, not programming code. The learning is derived from data. The right machine learning approach and methodologies stem from data-centric needs and result in projects that focus on working through the stages of data discovery, cleansing, training, model building and iteration.
7 steps to building a machine learning model
For many organizations, machine learning model development is a new activity and can seem intimidating. Even for those with experience in machine learning, building an AI model requires diligence, experimentation and creativity. The methodology for building data-centric projects, however, is somewhat established. The following steps will help guide your project.
Step 1. Understand the business problem (and define success)
The first phase of any machine learning project is developing an understanding of the business requirements. You need to know what problem you're trying to solve before attempting to solve it.
This article is part of
To start, work with the owner of the project and make sure you understand the project's objectives and requirements. The goal is to convert this knowledge into a suitable problem definition for the machine learning project and devise a preliminary plan for achieving the project's objectives. Key questions to answer include the following:
- What's the business objective that requires a cognitive solution?
- What parts of the solution are cognitive, and what aren't?
- Have all the necessary technical, business and deployment issues been addressed?
- What are the defined "success" criteria for the project?
- How can the project be staged in iterative sprints?
- Are there any special requirements for transparency, explainability or bias reduction?
- What are the ethical considerations?
- What are the acceptable parameters for accuracy, precision and confusion matrix values?
- What are the expected inputs to the model and the expected outputs?
- What are the characteristics of the problem being solved? Is this a classification, regression or clustering problem?
- What is the "heuristic" -- the quick-and-dirty approach to solving the problem that doesn't require machine learning? How much better than the heuristic does the model need to be?
- How will the benefits of the model be measured?
Although there are a lot of questions to be answered during the first step, answering or even attempting to answer them will greatly increase the chances of overall project success.
Setting specific, quantifiable goals will help realize measurable ROI from the machine learning project instead of simply implementing it as a proof of concept that'll be tossed aside later. The goals should be related to the business objectives and not just to machine learning. While machine learning-specific measures -- such as precision, accuracy, recall and mean squared error -- can be included in the metrics, more specific, business-relevant key performance indicators (KPIs) are better.
Step 2. Understand and identify data
Once you have a firm understanding of the business requirements and receive approval for the plan, you can start to build a machine learning model, right? Wrong. Establishing the business case doesn't mean you have the data needed to create the machine learning model.
A machine learning model is built by learning and generalizing from training data, then applying that acquired knowledge to new data it has never seen before to make predictions and fulfill its purpose. Lack of data will prevent you from building the model, and access to data isn't enough. Useful data needs to be clean and in a good shape.
Identify your data needs and determine whether the data is in proper shape for the machine learning project. The focus should be on data identification, initial collection, requirements, quality identification, insights and potentially interesting aspects that are worth further investigation. Here are some key questions to consider:
- Where are the sources of the data that's needed for training the model?
- What quantity of data is needed for the machine learning project?
- What is the current quantity and quality of training data?
- How are the test set data and training set data being split?
- For supervised learning tasks, is there a way to label that data?
- Can pre-trained models be used?
- Where is the operational and training data located?
- Are there special needs for accessing real-time data on edge devices or in more difficult-to-reach places?
Answering these important questions helps you get a handle on the quantity and quality of data as well as understand the type of data that's needed to make the model work.
In addition, you need to know how the model will operate on real-world data. For example, will the model be used offline, operate in batch mode on data that's fed in and processed asynchronously, or be used in real time, operating with high-performance requirements to provide instant results? This information will also determine the sort of data needed and data access requirements.
Determine also whether the model will be trained once, in iterations with versions of it deployed periodically or in real time. Real-time training imposes many requirements on data that might not be feasible for some setups.
During this phase of the AI project, it's also important to know if any differences exist between real-world data and training data as well as test data and training data, and what approach you will take to validate and evaluate the model for performance.
Step 3. Collect and prepare data
Once you've appropriately identified your data, you need to shape that data so it can be used to train your model. The focus is on data-centric activities necessary to construct the data set to be used for modeling operations. Data preparation tasks include data collection, cleansing, aggregation, augmentation, labeling, normalization and transformation as well as any other activities for structured, unstructured and semi-structured data.
Procedures during the data preparation, collection and cleansing process include the following:
- Collect data from the various sources.
- Standardize formats across different data sources.
- Replace incorrect data.
- Enhance and augment data.
- Add more dimensions with pre-calculated amounts and aggregate information as needed.
- Enhance data with third-party data.
- "Multiply" image-based data sets if they aren't sufficient enough for training.
- Remove extraneous information and deduplication.
- Remove irrelevant data from training to improve results.
- Reduce noise reduction and remove ambiguity.
- Consider anonymizing data.
- Normalize or standardize data to get it into formatted ranges.
- Sample data from large data sets.
- Select features that identify the most important dimensions and, if necessary, reduce dimensions using a variety of techniques.
- Split data into training, test and validation sets.
Data preparation and cleansing tasks can take a substantial amount of time. Surveys of machine learning developers and data scientists show that the data collection and preparation steps can take up to 80% of a machine learning project's time. As the saying goes, "garbage in, garbage out." Since machine learning models need to learn from data, the amount of time spent on prepping and cleansing is well worth it.
Step 4. Determine the model's features and train it
Once the data is in usable shape and you know the problem you're trying to solve, it's finally time to move to the step you long to do: Train the model to learn from the good quality data you've prepared by applying a range of techniques and algorithms.
This phase requires model technique selection and application, model training, model hyperparameter setting and adjustment, model validation, ensemble model development and testing, algorithm selection, and model optimization. To accomplish all that, the following actions are required:
- Select the right algorithm based on the learning objective and data requirements.
- Configure and tune hyperparameters for optimal performance and determine a method of iteration to attain the best hyperparameters.
- Identify the features that provide the best results.
- Determine whether model explainability or interpretability is required.
- Develop ensemble models for improved performance.
- Test different model versions for performance.
- Identify requirements for the model's operation and deployment.
The resulting model can then be evaluated to determine whether it meets the business and operational requirements.
Step 5. Evaluate the model's performance and establish benchmarks
From an AI perspective, evaluation includes model metric evaluation, confusion matrix calculations, KPIs, model performance metrics, model quality measurements and a final determination of whether the model can meet the established business goals. During the model evaluation process, you should do the following:
- Evaluate the models using a validation data set.
- Determine confusion matrix values for classification problems.
- Identify methods for k-fold cross-validation if that approach is used.
- Further tune hyperparameters for optimal performance.
- Compare the machine learning model to the baseline model or heuristic.
Model evaluation can be considered the quality assurance of machine learning. Adequately evaluating model performance against metrics and requirements determines how the model will work in the real world.
Step 6. Put the model in operation and make sure it works well
When you're confident that the machine learning model can work in the real world, it's time to see how it actually operates in the real world -- also known as "operationalizing" the model:
- Deploy the model with a means to continually measure and monitor its performance.
- Develop a baseline or benchmark against which future iterations of the model can be measured.
- Continuously iterate on different aspects of the model to improve overall performance.
Model operationalization might include deployment scenarios in a cloud environment, at the edge, in an on-premises or closed environment, or within a closed, controlled group. Among operationalization considerations are model versioning and iteration, model deployment, model monitoring and model staging in development and production environments. Depending on the requirements, model operationalization can range from simply generating a report to a more complex, multi-endpoint deployment.
Step 7. Iterate and adjust the model
Even though the model is operational and you're continuously monitoring its performance, you're not done. When it comes to implementing technologies, it's often said that the formula for success is to start small, think big and iterate often.
Always repeat the process and make improvements in time for the next iteration. Business requirements change. Technology capabilities change. Real-world data changes in unexpected ways. All of which might create new requirements for deploying the model onto different endpoints or in new systems. The end may just be a new beginning, so it's best to determine the following:
- the next requirements for the model's functionality;
- expansion of model training to encompass greater capabilities;
- improvements in model performance and accuracy;
- improvements in model operational performance;
- operational requirements for different deployments; and
- solutions to "model drift" or "data drift," which can cause changes in performance due to changes in real-world data.
Reflect on what has worked in your model, what needs work and what's a work in progress. The surefire way to achieve success in machine learning model building is to continuously look for improvements and better ways to meet evolving business requirements.
Historical perspective on model building
About 25 years ago, a consortium of five vendors developed the Cross-Industry Standard Process for Data Mining (CRISP-DM), which focused on a continuous iteration approach to the various data-intensive steps in a data mining project. The methodology starts with an iterative loop between business understanding and data understanding. That's followed by a handoff to an iterative loop between data preparation and data modeling, then by an evaluation phase, which splits its results to deployment and back to the business understanding. This cyclic, iterative loop leads to continuous data modeling, preparation and evaluation.
But further development of CRISP-DM seems to have stalled at a 1.0 version that was fully produced almost two decades ago, with only rumors of a second version under way nearly 15 years ago. IBM and Microsoft have iterated on the methodology to produce their own variants that add more detail to iterative loops between data processing and modeling, along with more specifics on artifacts and deliverables produced during the process.
In addition, the methodology was criticized for not being particularly agile or specific to AI and machine learning projects. Methodologies, such as Cognitive Project Management for AI, were enhanced to meet AI-specific requirements, and they can be implemented in organizations with existing Agile development teams and data organizations. Those methodologies, as well as learnings from large companies and their data science teams, have resulted in a stronger, more flexible step-by-step approach to machine learning model development that meets the specific needs of cognitive projects.