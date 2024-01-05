The idea of a "pipeline" is hardly new. Defining and implementing a well-understood workflow has long been key to efficient and cost-effective physical manufacturing -- just look at any factory floor.

When software development emerged and productized the intellectual property of modern businesses, the idea of a pipeline became a foundation of development efforts and effective software project management. And as machine learning and AI gain traction, the ideas and benefits of development pipelines can be directly applied to ML model creation and deployment. But it's important to understand the role, benefits and features of pipelines in ML projects.

What is a machine learning pipeline? An ML pipeline is the end-to-end process used to create, train and deploy an ML model. Pipelines help ensure that each ML project is approached in a similar manner, enabling business leaders, developers, data scientists and operations staff to participate in the final ML product. When implemented properly, an ML pipeline establishes cross-disciplinary collaboration among developers, data scientists and operations staff. The similarities of ML pipelines to other software development pipelines are so strong that the label MLOps has arisen for staff specializing in ML operations, drawing on the DevOps moniker. Although the actual steps and complexity of an ML pipeline can vary dramatically based on a business's needs and goals, ML pipelines typically involve the following broad phases: Definition of business goals and objectives.

Data selection and processing.

Model creation or selection and training.

Model deployment and management.

Benefits of a machine learning pipeline Establishing and refining an ML pipeline brings many of the same benefits that organizations depend on for other software development environments. These include speed, uniformity, common understanding, automation and orchestration. Accelerated ML model deployment There are many potential avenues for ML model development, training, testing and deployment. But when different ML teams or projects within the same organization take different approaches, the variations in processes and procedures can result in unwanted delays and unexpected problems. Creating a uniform ML pipeline avoids that potential chaos. With an established pipeline, each ML project is approached the same way, using the same tools and processes. This means that a properly implemented and well-considered ML pipeline can accelerate projects, bringing an ML model from inception to deployment much faster than free-form approaches. Favors reusable components Establishing a pipeline in an ML project enables organizations to break down processes and workflows into clear and purposeful steps, such as model testing. Data and data sets are also important reusable components in ML projects. Embracing the idea of reusability lets model developers create steps and data sets that can be selected and connected with orchestration and automation technologies to streamline the pipeline. This helps to both accelerate the project and reduce human error, which can lead to oversights and security problems. In addition, workflow steps in the form of reusable components can allow for variations in the overall pipeline -- such as skipping or adding certain steps as needed -- without requiring staff to create and support new pipelines for different purposes. Easier model troubleshooting ML projects are fundamentally software development efforts that blend data and data science into the project lifecycle. Consequently, like anything else in software development, ML projects entail inevitable design, coding, testing and analytical challenges that require troubleshooting and remediation. Establishing a common ML pipeline with clear steps and reusable components helps ensure that each ML project follows a similar well-defined pattern. This makes ML projects easier to understand, evaluate and troubleshoot. Better regulatory oversight The enterprise is ultimately responsible for the ML model that is deployed and the outcomes it provides. As ML models are increasingly productized, they are also increasingly scrutinized for factors such as the following: Model performance -- for example, speed of response to user queries.

Bias in how the model is trained and makes decisions.

Data security, including user preferences and other potentially sensitive information. A pipeline enables an organization to codify and document the processes and elements of its ML projects. This encompasses data components, selection criteria, testing regimens, training techniques, monitoring and observability metrics, and KPIs. In addition, each component of the pipeline can be independently documented and evaluated for security, compliance, business continuity and alignment with business objectives -- in other words, whether the ML project is driving the intended results for the business. Blends DevOps and MLOps Machine learning projects often involve considerable software development investment, particularly when creating or modifying models and algorithms. DevOps has long been a leading methodology for software projects, combining development with common operations tasks such as deployment and monitoring. As ML projects gain traction in the enterprise, they're increasingly handled by MLOps teams responsible for ML development, deployment and operations. In this context, pipelines play a crucial role in providing control and governance for both ML projects and MLOps environments. MLOps is often viewed as an extension or subcategory of DevOps, with a specialized focus on machine learning models and AI applications.

Basic steps of a machine learning pipeline There is no single way to build an ML pipeline, and the details can vary dramatically based on business size and industry requirements. However, most ML pipelines can be divided into four major phases: business goal setting, data preparation, model preparation and model deployment. 1. Business goal setting Any ML project workflow should start with a careful consideration of business needs and goals. Business and technology leaders as well as ML project stakeholders should take the time to design an ML pipeline based on current and anticipated future requirements. Common business goals include the following: Understand the purpose of the ML project and its expected outcomes for the business.

Understand the prevailing regulatory environment for ML models, data sets and ML usage.

Consider the relationship of ML to current and future business governance obligations. For example, could the business be adversely affected by choices such as which data is selected or which models are trained? Could the business be subject to litigation in the event of algorithmic bias or other undesirable outcomes for users? Such issues should be addressed when developing the overall ML pipeline. 2. Data preparation Data is the soul of ML and AI; without data, neither can exist. A major portion of any ML pipeline thus involves acquiring and preparing data used to train the ML model. This phase of the ML pipeline can include several substeps, such as the following: Data selection. This step in the pipeline defines what data is used to train the model. Data can be selected or acquired from myriad sources, including user records, real-time devices such as IoT fleets, experimental test results such as scientific research and a wide range of third-party data sources. For example, if the business is developing an ML model to make purchase recommendations to online retail customers, the selected data will likely include user preferences and purchase history.

