nexus

command module
v1.1.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 6, 2025 License: Apache-2.0 Imports: 10 Imported by: 0

README

coverage

Introduction

Nexus is a lightweight, Kubernetes-native, high-throughput client-server proxy for machine learning, optimization and AI algorithms. Nexus allows data science teams to author and deploy their products to live environments with a few YAML manifests, and provides a simple flexible HTTP API for interacting with machine learning applications.

Core Nexus features include:

  • Elastic virtual queues for incoming requests
    • Incoming request rate normalization via throttling to limits acceptable by Kubernetes API Server
    • Automatic scheduling of a delayed request ahead of a queue
  • Allocation to autoscaling machine groups using Kubernetes affinity/toleration settings
  • Plug-n-play algorithm execution and deployment via Kubernetes Custom Resources
  • Multi-cluster support without extra configuration with the help of Nexus Configuration Controller
  • Input buffering and push down to algorithm container using secure URIs
  • Processing rate from receiving to result on a scale of thousands of completions per second, depending algorithm execution time.
  • HTTP API for launching runs and receiving results via presigned URLs.

Design

Nexus can be deployed in Single Cluster or Multi Cluster modes. Single Cluster mode consists of three components: at least one scheduler, a supervisor and at least one receiver, all deployed to a single Kubernetes cluster. Multi Cluster mode consists of:

  • Controller Cluster, which has at least one scheduler
  • One or more Shard Clusters, with at least one receiver and a supervisor.
    • scheduler can also be deployed to these clusters in case an algorithm uses Nexus SDK to create execution trees
Supervisor

Supervisor handles the following scenarios:

  • Requests that were delayed due to a scheduler instance shutdown - those are picked up by the supervisor and submitted to the target cluster.
  • Misconfigured requests that could not be converted to a Kubernetes Job for any reason
  • Garbage collecting failed submissions
  • State accounting and garbage collecting submissions with container launch issues such as ImagePullBackoff or runtime failures such as OOMKill etc.
Scheduler

Scheduler is what makes it possible to run algorithms through Nexus. Each scheduler has a public API that can be used to submit runs and retrieve results. Moreover, each scheduler holds a separate virtual queue that it uses to process incoming requests. Nexus relies on load balancer using round-robin algorithm when distributing requests between scheduler pods, so a horizontal autoscaler should be used for production deployments.

Quickstart

  1. Deploy Nexus Custom Resource Definitions:
helm install nexus-crd --namespace nexus --create-namespace oci://ghcr.io/sneaksanddata/helm/nexus-crd --version v1.0.0

Make sure to always install the latest CRD version. v1 API is guaranteed to be backwards compatible with all v1.* CRD releases, however latest is not.

  1. We currently do not ship backend dependencies as subcharts, thus make sure you have the following up and running:
  • An Apache Cassandra or similar cluster
  • An S3-compatible object storage
  • optional a Kubernetes cluster for running workloads

For Cassandra you can use AstraDB, Scylla or native Cassandra, however you'll need to use Scylla config to configure it. For S3, make sure you have the following:

  • a dedicated bucket
  • an IAM policy for the service account Nexus will use. Permissions required are:
    • "s3:ListBucket"
    • "s3:GetObject"
    • "s3:PutObject"
    • "s3:DeleteObject"
    • "s3:DeleteObjectVersion"
    • "s3:*MultipartUpload"
    • "s3:*Parts"
  1. Create the secrets containing configuration for Cassandra, S3 and Kubernetes connections. You can find examples for Cassandra and S3 in helm values. In case of a single cluster mode, you do not need to provide external kubernetes configs. For a multi-cluster mode, export your worker clusters kubeconfigs and create the following secret in both controller and worker clusters. For example, if you have a worker cluster called nexus-worker-cluster with kubeconfig file config.json contents:
{
  "apiVersion": "v1",
  "clusters": [
    {
      "cluster": {
        "certificate-authority-data": "...",
        "server": "https://..."
      },
      "name": "nexus-worker-cluster"
    }
  ],
  "contexts": [
    {
      "context": {
        "cluster": "nexus-worker-cluster",
        "user": "user"
      },
      "name": "nexus-worker-cluster"
    }
  ],
  "current-context": "nexus-worker-cluster",
  "kind": "Config",
  "preferences": {},
  "users": [
    {
      "name": "user",
      "user": {
        "exec": {
          "apiVersion": "client.authentication.k8s.io/v1beta1",
          "args": [],
          "command": "get-token",
          "env": [],
          "interactiveMode": "IfAvailable",
          "provideClusterInfo": false
        }
      }
    }
  ]
}

You can then create a secret:

kubectl create secret generic nexus-workers \
    --from-file=nexus-worker-cluster=./config.json

Now you are ready to install the scheduler:

helm install nexus --namespace nexus --create-namespace oci://ghcr.io/sneaksanddata/helm/nexus \
--set scheduler.config.s3Buffer.payloadStoragePath=s3a://nexus-s3-bucket/payload-store \
--set scheduler.config.s3Buffer.s3Credentials.secretName=nexus-s3 \
--set scheduler.config.cqlStore.type=scylla \
--set scheduler.config.cqlStore.secretName=nexus-cassandra \
--set ginMode=release
Versioning

Nexus's API is versioned. Requests against a specific version with have URI suffix like this: /algorithm/v1/run. If a version is not specified, latest API will be targeted - which might include experimental and unstable features. For production, always use a stable API version /algorithm/v1/run. Most changes tested under latest will eventually be integrated into v1. When a next major release v2 comes along, v1 will be a supported release until v3 is released. Feature requests must be tagged by an API version they target - currently, v1 only.

API Management

Adding new API paths must be reflected in Swagger docs, even though the app doesn't serve Swagger. Update the generated docs:

./swag init --parseDependency --parseInternal -g main.go

This is required for the API clients (Go and Python) to be updated correctly. Note that until Swag 2.0 is released OpenAPI v3 model must be updated using Swagger converter

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
api
v1
Package docs Code generated by swaggo/swag.
Package docs Code generated by swaggo/swag.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL