Skip to content

DevStack's ML Pipeline
Cloud Service

MLOps workflow

machine learning

How to build a solid MLOps process

Steps to build a solid MLOps process
The following are some steps you can follow to build a solid process for MLOps.

The experimentation capability lets your data scientists and ML researchers collaboratively perform exploratory data analysis, create prototype model architectures, and implement training routines. An ML environment should also let them write modular, reusable, and testable source code that is version controlled. Key functionalities in experimentation include the following:

  • Provide notebook environments that are integrated with version control tools like Git.
  • Track experiments, including information about the data, hyperparameters, and evaluation metrics for reproducibility and comparison.
  • Analyze and visualize data and models.
  • Support exploring datasets, finding experiments, and reviewing implementations.
  • Integrate with other data services and ML services in your platform.

ML Pipeline Notebooks provides a way to run web-based development environments inside your Kubernetes cluster by running them inside Pods.

The data processing capability lets you prepare and transform large amounts of data for ML at scale in ML development, in continuous training pipelines, and in prediction serving. Key functionalities in data processing include the following:

  • Support interactive execution (for example, from notebooks) for quick experimentation and for long-running jobs in production.
  • Provide data connectors to a wide range of data sources and services, as well as data encoders and decoders for various data structures and formats.
  • Provide both rich and efficient data transformations and ML feature engineering for structured (tabular) and unstructured data (text, image, and so on).
  • Support scalable batch and stream data processing for ML training and serving workloads.

The model training capability lets you efficiently and cost-effectively run powerful algorithms for training ML models. Model training should be able to scale with the size of both the models and the datasets that are used for training. Key functionalities in model training include the following:

  • Support common ML frameworks and support custom runtime environments.
  • Support large-scale distributed training with different strategies for multiple GPUs and multiple workers.
  • Enable on-demand use of ML accelerators.
  • Allow efficient hyperparameter tuning and target optimization at scale.
  • Ideally, provide built-in automated ML (AutoML) functionality, including automated feature selection and engineering as well as automated model architecture search and selection.

AutoML is a tool that automates repeated experiments to increase the predictive accuracy and performance of machine learning models.

The model evaluation capability lets you assess the effectiveness of your model, interactively during experimentation and automatically in production. Key functionalities in model evaluation include the following:

  • Perform batch scoring of your models on evaluation datasets at scale.
  • Compute pre-defined or custom evaluation metrics for your model on different slices of the data.
  • Track trained-model predictive performance across different continuous-training executions.
  • Visualize and compare performances of different models.
  • Provide tools for what-if analysis and for identifying bias and fairness issues.
  • Enable model behavior interpretation using various explainable AI techniques.

The model serving capability lets you deploy and serve your models in production environments. Key functionalities n model serving include the following:

  • Provide support for low-latency, near-real-time (online) prediction and high-throughput batch (offline) prediction.
  • Provide built-in support for common ML serving frameworks (for example, TensorFlow Serving, TorchServe, Nvidia Triton, and others for Scikit-learn and XGBoost models) and for custom runtime environments.
  • Enable composite prediction routines, where multiple models are invoked hierarchically or simultaneously before the results are aggregated, in addition to any required pre- or post-processing routines.
  • Allow efficient use of ML inference accelerators with autoscaling to match spiky workloads and to balance cost with latency.
  • Support model explainability using techniques like feature attributions for a given model prediction.
  • Support logging of prediction serving requests and responses for analysis.

The online experimentation capability lets you understand how newly trained models perform in production settings compared to the current models (if any) before you release the new model to production. For example, using a small subset of the serving population, you use online experimentation to understand the impact that a new recommendation system has on click-throughs and on conversation rates. The results of online experimentation should be integrated with the model registry capability to facilitate the decision about releasing the model to production. Online experimentation enhances the reliability of your ML releases by helping you decide to discard ill-performing models and to promote well-performing ones. Key functionalities in online experimentation include the following:

  • Support canary and shadow deployments.
  • Support traffic splitting and A/B tests.
  • Support multi-armed bandit (MAB) tests.

ML Pipelines(MLP) is a platform for building and deploying portable, scalable ML workflows based on Docker containers.

The model monitoring capability lets you track the efficiency and effectiveness of the deployed models in production to ensure predictive quality and business continuity. This capability informs you if your models are stale and need to be investigated and updated. Key functionalities in model monitoring include the following:

  • Measure model efficiency metrics like latency and serving-resource utilization.
  • Detect data skews, including schema anomalies and data and concept shifts and drifts.
  • Integrate monitoring with the model evaluation capability for continuously assessing the effectiveness performance of the deployed model when ground truth labels are available.

A run is a single execution of a pipeline. Runs comprise an immutable log of all experiments that you attempt, and are designed to be self-contained to allow for reproducibility.

The ML pipelines capability lets you instrument, orchestrate, and automate complex ML training and prediction pipelines in production. ML workflows coordinate different components, where each component performs a specific task in the pipeline. Key functionalities in ML pipelines include the following:

  • Trigger pipelines on demand, on a schedule, or in response to specified events.
  • Enable local interactive execution for debugging during ML development.
  • Integrate with the ML metadata tracking capability to capture pipeline execution parameters and to produce artifacts.
  • Provide a set of built-in components for common ML tasks and also allow custom components.
  • Run on different environments, including local machines and scalable cloud platforms.
  • Optionally, provide GUI-based tools for designing and building pipelines.

A pipeline is a description of an ML workflow, including all of the components in the workflow and how they combine in the form of a graph.

The model registry capability lets you govern the lifecycle of the ML models in a central repository. This ensures the quality of the production models and enables model discovery. Key functionalities in the model registry include the following:

  • Register, organize, track, and version your trained and deployed ML models.
  • Store model metadata and runtime dependencies for deployability.
  • Maintain model documentation and reporting—for example, using model cards.
  • Integrate with the model evaluation and deployment capability and track online and offline evaluation metrics for the models.
  • Govern the model launching process: review, approve, release, and roll back. These decisions are based on a number of offline performance and fairness metrics and on online experimentation results.

The dataset and feature repository capability lets you unify the definition and the storage of the ML data assets. Having a central repository of fresh, high-quality data assets enables shareability, discoverability, and reusability. The repository also provides data consistency for training and inference. This helps data scientists and ML researchers save time on data preparation and feature engineering, which typically take up a significant amount of their time. Key functionalities in the data and feature repository include the following:

  • Enable shareability, discoverability, reusability, and versioning of data assets.
  • Allow real-time ingestion and low-latency serving for event streaming and online prediction workloads.
  • Allow high-throughput batch ingestion and serving for extract, transform, load (ETL) processes and model training, and for scoring workloads.
  • Enable feature versioning for point-in-time queries.
  • Support various data modalities, including tabular data, images, and text.

ML data assets can be managed at the entity features level or at the full dataset level. For example, a feature repository might contain an entity called customer, which includes features like age group, postal code, and gender. On the other hand, a dataset repository might include a customer churn dataset, which includes features from the customer and product entities, as well as purchase- and web-activity event logs.

Various types of ML artifacts are produced in different processes of the MLOps lifecycle, including descriptive statistics and data schemas, trained models, and evaluation results. ML metadata is the information about these artifacts, including their location, types, properties, and associations to experiments and runs. The ML metadata and artifact tracking capability is foundational to all other MLOps capabilities. Such a capability enables reproducibility and debugging of complex ML tasks and pipelines. Key functionalities in ML metadata and artifact tracking include the following:

  • Provide traceability and lineage tracking of ML artifacts.
  • Share and track experimentation and pipeline parameter configurations.
  • Store, access, investigate, visualize, download, and archive ML artifacts.
  • Integrate with all other MLOps capabilities.

Subscription term

This period is offered on a monthly basis from the specified effective date and automatically renews.

 

Price

ML PIPELINE™ service unit KRW 5,000,000 per month (excluding VAT)