Criteria for "production-ready" MLOps

The landscape of MLOps tooling has become richer and richer over the last few years. Most emerging tools expand from an initial focus point, as in "that one nagging problem" they set out to solve. The resulting jungle of opinionated tooling in the ecosystem can become overwhelming at times.

The feature sets of any MLOps tooling can be separated into two categories: fundamentals and productionization. All solutions in this analysis are capable of fundamentals, e.g. being agnostic to the actual model architecture or to allow for non-local training. However, the actual value of a MLOps solution is in reproducibility of Machine Learning, the seamless transition from experiment to production and coverage of the ML lifecycle from data to continuous training. Key focus will therefore be on productionization features to draw distinctions between solutions.

What is "Productionization"?

The abstract term productionization can be broken down into these "harder" criteria:

  • Tracking of input parameters
    (For both preprocessing functions as well as for model training)
  • Comparability between pipelines
    (Use standardized tracking to allow conclusions across pipelines)
  • Caching of pipeline steps
    (Reduce cost and time by reusing processed artifacts)
  • Versioning from data to model
    (Input data, artifacts, preprocessing and model code, trained models)
  • Guaranteed reproducibility
    (Every model training pipeline can be reproduced, always)
  • Cloud Provider integrations
    (without the need to orchestrate workloads yourself)
  • Modular backends
    (From distributed preprocessing to training and serving)
  • Scalability to large datasets
    (Iterative experiments on huge data through distributed processing and caching)
  • From experiment to deployment
    (Manage every lifecycle stage from early experimentation to continuous training)

Feature distinction at a glance

The ecosystem has become a fast-moving pool of interesting companies. While we integrate in the "best-in-class" solutions like Seldon Core or Tecton, there are some other competing solutions out there. We compiled an at-a-glance overview of how the Core Engine stacks up against them.

Disclaimer: All trademarks belong to their respective owners. All information was compiled and last updated in October 2020.

The Core Engine vs. MLFlow

How are we different from MLFlow?

Historically, MLFlow has come from a strong focus on experiment tracking. Over time they've added a model registry and a retroactive packaging approach for experiments (e.g. your git repo contains an MLproject YAML file for conda/docker). The responsibility for orchestration, pipelining-code, and integration of MLFlow functionality still needs to be baked into your codebase first. It will most definitely not cover your needs to take ML all the way to production, but only aspects of it.

In contrast, we are taking a platform-as-a-service approach to bring ML from experiments to production, reproducibly and continuously. For that we're providing the user experience for the entire ML workflow while taking care of all the "ops-y" aspects:

  • versioning of data,
  • versioning of preprocessing and models,
  • orchestration of ML pipelines (training, batch inference, continuous training, etc.),
  • integrations to various backends for processing, training and serving
  • pre-generated evaluation notebooks with fully transparent access to all pipeline artifacts

By focussing strongly on a config-driven design and the abstracted orchestration we can guarantee comparability across all training pipelines, built-in caching for faster pipelines, and reusability of all pipeline aspects across the entire team.

The Core Engine vs. Kubeflow

How are we different from Kubeflow?

Kubeflow is a powerful, open-source ML pipelining solution, with a rudimentary ecosystem of integrations (read: Seldon Core and opinionated, Google-created open source complementary projects like MLMetadata for experiment tracking). It runs exclusively on Kubernetes, with no native distributed processing mechanism.

These design-decisions require adopters to operate and maintain:

  1. the Kubernetes cluster (to running workloads),
  2. Kubeflow (to orchestrate pipelines),
  3. and optionally integrate a distributed compute solution into their pipeline code (e.g. via additional OSS tooling like TFX and Apache Beam)

As good a solution it already is, Kubeflow suffers from a decidedly unclear direction for the next 12 months. The effort required to achieve reproducible results from data to model is high. Data versioning, comparability of pipelines, caching and platform-independence (e.g. bare metal, Spark) can’t be achieved without building complementary systems. It’s high degrees of freedom of pipeline construction require stricter diligence of developers and data scientists.

The Core Engine brings a holistic ecosystem of functionalities out of the box - without the need to operate more than one platform. As a result, all ML pipelines are reproducible, versioned from data to model and guaranteed deployable to production, without Data Scientists having to deviate from known-good workflows. By offering an integration-based approach to providers (Google Cloud, AWS, Azure, On-Premise), backends for computation (e.g. Spark, Kubernetes, bare metal) as well as best-in-class Machine Learning training environments (Google AI Platform, Sagemaker, Azure ML, NVIDIA DGX-1), the Core Engine gets you started instantly in your own cloud, and allows your projects to grow to powerful over time by adding more providers and backends.

The Core Engine vs. AWS/GCP/Azure

How are we different from the ML products of the big Cloud Providers?

The three big Cloud providers (Amazon, Microsoft and Google) all have a distinct and mature product palette for Machine Learning. At a minimum, they provide access to powerful Machine Learning training environments, with a growing numbers of complementary products (e.g. Google AI Platform Pipelines) assisting with preprocessing of data and subsequent serving of trained Machine Learning models.

Generally speaking, by building your MLOps infrastructure on top of any Cloud provider's offerings, repsponsibility for key aspects of your solution will shift to you:

  • versioning
  • tracking
  • concatenating pipeline stages
  • continuous training

On the flipside, the ML offerings usually provide a very stable and mature base environment and allow for deep integration into existing DevOps landscapes in your organisation.

And because we fully understand their potential, the Core Engine is not at odds with them, but rather leverages their power via our integration-based backend system. With one command-line command you can add any of these training backends to your Core Engine account and benefit from all their advantages - without losing access to any of our features.

All your pipelines will continue to be reproducible, versioned from data to model and cached throughout.

The Core Engine vs. Valohai

How are we different from Valohai's MLOps solution?

One of the more widely-known startups in the ecosystem is Valohai. They self-describe their offering with a “heavy focus on machine orchestration”, and sacrifice other convencience functionality to maintain their focus.

Versioning is accomplished through a mix of built-ins and reliance on customer-side versioning of code in a git-backend. Getting data from various data sources is a heavily involced process for customers. While they do offer the ability to stream data, sourcing data in general is often the responsibility of clients and happens as first step of self-assembled Pipelines. On that note, Valohai pipelines can be assembled to a high degree of complexity, and they are capable of handling large datasets through a Spark integration.

Pipelines, due to their inherent freedom, are not comparable across experiments, and reproducibility can not be guaranteed. It’s a constantly evolving product, so their feature set will surely improve over time.

The Core Engine vs. Databricks

How are we different from Databricks?

Databricks is the company of the original creators of Apache Spark all the way back in 2009 at UC Berkeley. Their commercial offering is by now a big player in the market, predominantly focussed at larger organizations and enterprises.

The key focus lies on less technically versed data scientists by providing layers of abstraction on top of the powerful parallel and distributed computation that Spark provides. AutoML and one-click solutions for common problems (like deploying models and autoscaling deployments) in the ML lifecycle are the main selling points.

As a managed platform, Databricks falls short on the portability of workloads in your existing cloud projects. Integrations and partial offerings on AWS and Azure are emerging, but do not support the full product range yet.

The list of support data sources is continually expanded, but versioning of data and therefore reproducibility of pipelines still falls solely in the responsibility of you, the customer. Caching is possible due to the underlying Spark architecture, but therefore prohibits users from using other processing backends.