Manually deploy a serverless inference endpoint

Scalable AI model inference is handled by the AI4OS Inference platform, powered by the OSCAR open-source serverless platform.

An OSCAR cluster consists of, among other components:

  • a Kubernetes cluster than can optionally auto-scale, in terms of number of nodes, within a certain boundaries.

  • configured with MinIO, a high-performance object storage system, so that file uploads to a MinIO bucket can trigger the invocation of an OSCAR service to perform AI model inference.

  • configured with Knative, a FaaS platform, so that synchronous requests to an OSCAR service are handled via dynamically provisioned pods (containers) in the Kubenetes cluster.

The AI4OS Inference platform consists of a pre-deployed OSCAR cluster exclusively accessible for fully authenticated users.

We have different OSCAR clusters depending on the project you belong to:

You can also launch services via the command-line interface (CLI).

Warning

This cluster is provided for testing purposes and OSCAR services may be removed at any time depending on the underlying infrastructure capacity and usage rates. Should this happen, you can easily re-deploy the services from the corresponding FDL file.

1. Configuring an OSCAR service

The cluster is used to deploy OSCAR services, which are described by a Functions Definition Language (FDL) file which specifies (among other features):

  • The Docker image, which includes the AI model that supports the DEEPaaS API and all the required libraries and data to perform the inference.

  • The computing requirements (CPUs, RAM, GPUs, etc.).

  • The shell-script to be executed inside the container created out of the Docker image for each service invocation.

  • (Optional) The link to a MinIO bucket and an input folder.

2. Invoking an OSCAR service

OSCAR services can be invoked (see Invoking services for further details):

  • Asynchronously, by uploading files to a MinIO bucket to trigger the OSCAR service upon file uploads.

  • Synchronously, by invoking the service from OSCAR CLI or via the OSCAR Manager’s REST API. A certain number of pre-deployed containers can be kept up and running to mitigate the cold start problem (initial delays when performing the first invocations to the service).

  • Through Exposed Services, where stateless services created out of large containers require too much time to be started to process a service invocation. This is the case when supporting the fast inference of pre-trained AI models that require close to real-time processing with high throughput. In a traditional serverless approach, the AI model weights would be loaded in memory for each service invocation (thus creating a new container). With this approach AI model weights could be loaded just once and the service would perform the AI model inference for each subsequent request. An auto-scaled load-balanced approach for these stateless services is supported.

3. More info and examples