Serverless GPU computing with Modal (for custom models)

developmentai

Dec 29

re you developing custom Machine Learning models? Are you sick of heating your house with your GPU? Do you like comfortable DevOps and ML Ops things? Well specified setups and concise Serverless function hosting?

Modal is the new kid on the block. I am not an affiliate or anything. Just a random user with special interests. That being said, I found it relatively simple to port my local environment over to Modal. I didn’t want AWS or Azure GPU instances because the price calculation is horrible. And I also didn’t want to use Google Colab because it always disconnects (in the Pro version). ML dev is grindy, and good tools can make a difference. Modal is one of them, and they also have a fantastic GPU glossary.

Serverless for Agentic AI?

To use Serverless computing for Agentic AI… wait. Let’s get into Silly Valley marketing mode. “To use Service as a Software with Serverless backends”… To achieve that, you may need to get a little tricky. Firstly, the ML Ops part is challenging because you need to pass GPU instances to Container runtimes. Then you have to define all your ML dependencies… and then must make calculations. Costs here, costs there. Timing. Quality. Prediction accuracy. Etc. etc.

Modal makes the ML Ops / DevOps work easy when it comes to GPUs. I ported my Byte Latent Transformer (Meta research code) environment to it in about an hour. From a hacky Anaconda, Pip and Mamba system to a well-specified container.

Modal executes a Python function, that is automatically containerized. It starts to process the data and applies the ML as defined in code.

image = (modal.Image.debian_slim()
    .apt_install([
        "git",
        "libsnappy-dev",  # for snappy
        "zstd",           # for zstd
        "liblz4-dev",     # for lz4
    ])
    .pip_install([
        # Core ML and Data Science
        "torch",  # Will get CUDA-enabled version automatically
        "numpy==2.1.3",
        "pandas==2.2.3",
...

# Function to process embeddings with BLT
@app.function(
    gpu="A10G",
    image=image,
    volumes={MOUNT_PATH: volume},
    secrets=[secret],
    timeout=3600  # 1 hour timeout
)
def process_blt_embeddings():

You apply a couple of decorators and pre-define your env. Then run the commands.

modal deploy .\modal_blt_app.py
modal run modal_blt_app.py::process_blt_embeddings

Summary

As indicated in my previous blog post, I wanted to accelerate the generation of embeddings. Mission success. In my opinion, the important part is related to the supply chain here. I find it tricky to really own an ML workflow, end to end. OpenAI here, Modal there, AWS thisandthat. It’s important to keep supply chains simple and to keep the costs under direct control. Besides that, with offloaded Serverless workflows like this, it’s possible to use powerful Open-Source research like Meta’s Byte Latent Transformer implementation. Getting this as a Service from some random startup can be tricky.

gpudevopsml-opscontainer-runtimeserverlesspythonmachine_learning

Marius Ciepluch

Serverless GPU computing with Modal (for custom models)

Serverless for Agentic AI?

Summary

PowerShell with Rust command-line tools

Early check on Byte Latent Transformation (Dec 2024)