Colocating ML Inference and Training with Fast GPU Memory Handover

March 15, 2025

ATC’ 25 paper

I read a large bunch of papers around the area of GPU sharing, enough to build a taxonomy of sorts (topic for a later blog post!).

Intro and Motivation

Only some of this section is from the paper itself, rest is mostly my notes on compute sharing from different papers.

The motivation for compute sharing is quite simple - GPUs are a very precious resource and are sometimes underutilized, which is, simply put, not very optimal.

Compute sharing (of a GPU) can be thought of as 2 types: temporal and spatial. Assume a bar graph where x-axis represents time and y-axis represents GPU resources. In temporal sharing, each bar is exactly one job, which may or may not use the entire GPU resources. Spatial sharing has each bar representing 1 or more jobs.

Temporal sharing allows sharing of the GPU in different time slices. It has slightly larger overheads due to context-switching time. Especially memory switching (for large size in-context memory) could add to large overheads.

Spatial sharing avoids these problems by avoiding context switching and has the benefit of high utilization.

Why would someone do temporal sharing, then? Because, temporal sharing is the default when 2 processes execute simultaneously in an Nvidia GPU. There is a good paper (GPREEMPT) which actually uses temporal sharing quite effectively to

Spatial requires modifying the jobs, which is usually harder. Further problems arise when we co-run jobs, we need to prevent interference between the jobs through either static partitioning of resources or dynamic management. Over-using a resource like memory could lead to both jobs crashing.

This paper does spatial sharing.

Co-locating jobs

When doing spatial sharing, we further have the choice of which jobs to co-locate. There are several papers with different combinations, - this paper co-locates a training job and an inference job, a popular combination in order to match the bursty nature of one job with the long-running nature of the other, (others include inference-inference and training-training).

The paper does, in fact, use some characteristics of training jobs when co-locating (batch-wise execution, memory profile of training jobs etc.)

The paper - Sirius

The paper focuses on memory management when doing spatial sharing. Co-running multiple jobs simultaneously means that both the jobs’ context must be loaded in memory, which may OOM. One method of handling that problem is by shifting overflowing memory to host memory (in a Unified Virtual Memory approach), though shifting between GPU and CPU is a costly decision.

The paper does dynamic memory management - leveraging the fact that inference jobs have dynamic memory profiles and that training jobs are elastic enough to fit in the remaining. In their setting, inference jobs are first class citizens and get priority while training jobs are left to fend with the remaining resources.

Memory profile of training jobs

An interesting subsection of the paper where they analyze and present the memory profile of training jobs, since those are the jobs which need to adjust. It has some interesting tit-bits.

Memory during training consists of 2 components - Static training state, which is model and optimizer states which occupies a fixed amount of memory and intermediate results which are consumed after backward pass.
They find that Intermediate results take upto 90% of the total memory (this is for general ML workloads, I believe LLM workloads will have different profiles) and this can be tuned with batch size.

”"”However, the memory caching mechanism of ML frameworks (PyTorch), which reduces the memory allocation overhead to improve training performance, leads to the unchanged memory occupation of training tasks even when the batch size is reduced. This causes the unused memory of training tasks to be invisible to inference tasks.”””

Architecture

Sirius introduces a memory handover interface - Require(M) and Release(M). When the inference task requires more memory, it calls the Require and similarly releases when done.

For this, it uses a shared memory pool between the two tasks. This apparently bypasses the memory caching mechanism they mentioned before (How???).

They have mentioned using MLaaS interface inference engine but do not mention whether how much they customized it for adding the request and release APIs. The training engine is Pytorch based with the modifications to update batch size based on memory available.

For compute sharing, they are able to achieve Dynamic GPU SM sharing, basically using the SM Mask technique from a prior paper which introduces a libsmctrl . From a quick read, it seems that the library requires that tasks share a context (common memory address space), which requires modification to run as multiple threads of a single process. In their words it “leverages undocumented hardware capabilities to enforce partitions of TPC. Unfortunately, like earlier mechanisms, libsmctrl requires merging tasks into the same context to concurrently execute them, compromising logical isolation and transparency.”

The authors of the libsmctrl paper have a newer paper which I might do later in this series. It introduces an updated library (nvtaskset) which is able to solve some of the problems that the prior work has.

Icing on the cake

The above system sounds solid to me. Co-running inference and training. The inference system giving hints on its memory requirements and the training system adjusting training parameters (batch size) based on those hints. But this is an ATC paper, so the authors are not yet done.

Training batches take a few hundred milliseconds to complete. Waiting for the job to complete before releasing the memory is not feasible, since inference job SLOs are within a few milliseconds.

The paper proposes an instant discard of a batch in order to quickly service the inference job’s request for memory (this is where the high priority of inference jobs and best effort of training jobs features). This discard can be done as long as the training job is not updating the model parameters.

But, …. there is a problem. Asynchronous execution. Usually, when executing jobs on the GPU, the kernels are launched from the CPU. Therefore, even if the job is cut-off (discarding batch), prior launched kernels will be launched and will only release memory after completion, which adds overhead. To solve this problem, Sirius does software GPU kernel management by having a software queue of kernels which acts as a holding buffer without actually scheduling the kernels onto the GPU.

On top of this, they also support Multi-GPU training (only DDP apparently, not model parallelism or pipeline parallelism). The kernel launches of communication primitives add more confusion. Sirius aborts NCCL primitive launches without aborting the network connection itself, which takes a long time to re-setup.

Key Takeaways

This paper sits well within the compute sharing space. Like a few others in this space, their solution requires some low level tinkering and is quite unobvious. The asynchronous execution is the biggest example of this as well as the multi-GPU management (communicative primitive launch). Their compute sharing technique has opened up some references that I need to dig deeper into.

This paper doesn’t treat the 2 jobs equally. Inference are high priority jobs and training jobs are left as best effort. This may not always be the case and could result in the training jobs running forever if they never get priority. They do handle in certain cases that the training job doesn’t thrash for memory.
Most of their evaluation is using DNNs though they do have one experiment using LLMs, albeit a rather small training LLM (Qwen 0.5B). So not sure how exactly their findings and technique will extrapolate to larger models.