Chuck Tang
•

1. Introduction
Models today progress in discontinuous jumps in capability. To improve a model, a team must collect data, train, and ship a new version. Not only does this take months, it results in either remarkable or catastrophic behaviors for the downstream user.
At Trajectory, we aim to build the platform for continual learning!
A continual learning system should improve hourly, as user interactions enable a model to acquire new capabilities on the fly. Under this paradigm:
A coding agent can learn new engineering patterns as developers correct its work.
A personal assistant can improve planning and prioritization as the user reschedules.
A support agent can resolve hard tickets as operators intervene on difficult cases.
Most training infrastructure still assumes a linear lifecycle: allocate GPUs, initialize the model, run a job, spin down, and then repeat.
Continual learning revises this relationship.
When production interactions are taken as training inputs, training becomes part of a live system. The infrastructure that emerges is more like a distributed service, rather than a collection of independent training jobs.
2. Status Quo
Today most modern RL training infrastructure can be reduced to three core primitives:
Sampler. Generates trajectories from the current policy model.
Trainer. Computes gradients and updates the policy model weights from those trajectories.
Parameter synchronization. Updated weights are broadcast back to the inference workers so future rollouts can use the latest policy.

Traditional RL training stacks run this loop as a standalone job where GPU workers are allocated to a single experiment and the loop runs repeatedly.

This training paradigm has four key inefficiencies that our warm, multi-LoRA training stack aims to improve upon.
2.1. Slow, Cold Starts
Every experiment in the serial-job paradigm requires a full restart of the training stack. A restart consists of checkpoint reloads, distributed runtime initialization, and inference engine warmups. For large models, this step alone can exceed 30 minutes per run, throttling iteration speed.
A warm, distributed engine lets us initialize the model once; a persistent training service removes the need for per-job spin up times.
2.2. Large, Memory Intensive RL
Frontier models often exceed 100B parameters. Their weights, gradients, and optimizer state must all fit in GPU memory for RL training and sampling, making the hardware cost prohibitively high for many teams to start with. Qwen3.5-397B, for example, can require up to eight H200 nodes to fit into memory.
LoRA training cuts memory usage by an order of magnitude compared to full fine tuning. It freezes the base model so that only a small set of adapter weights, gradients, and optimizer states flow through the model training stack.
2.3. Single-Tenant Training
Traditional RL stacks need to spin up a full set of nodes for every job and can only run a single-tenant experiment at a time.
Multi-LoRA training breaks the one-job per set of GPUs paradigm by mapping each experiment to a dedicated LoRA adapter, thereby multiplexing experiment throughput by a factor of N.
2.4. Low Job Utilization
In synchronous RL training, the trainer stalls while waiting for the inference engine to finish and the inference engine stalls while waiting for the trainer. Async RL is used to address this performance penalty by running the trainer and generator concurrently; however, this introduces off-policy drift and ties together throughput optimizations for training and inference. This problem is compounded in agentic, tool-calling workloads, where the inference engine may sit idle during long tool calls. Furthermore, as the model learns to call tools correctly, the inference workload can shift dynamically, rendering the pre-determined, static load balancing between trainer and generator ineffective.
Multi-LoRA adds a new parallelism knob for improving training utilization by letting you load balance across jobs over time, rather than relying on static, intra-job throughput tuning within a single trainer-generator process group. In our experiments, multi-LoRA training dramatically improved inference throughput for underutilized RL training jobs.
3. Continuous Multi-LoRA Training (C-LoRA)

Together with the Berkeley Sky Lab and AnyScale team, we built an RL training stack for continual learning workloads. In the diagram above, we see how Multi-Tenant, Always-Hot Training can run three concurrent experiments faster than the time it takes a single-tenant, training framework to train two serial jobs end to end.
3.1. Architecture

3.1.1 Inference
Inference is where most of the multi-LoRA concurrent performance improvements come from. In vLLM, all adapters stay hot-loaded in GPU memory so that decode steps can mix tokens from different adapters in the same batch. The key enabler behind this is the SGMV decode kernel which fuses per-adapter matrix-vector work so that multiple LoRAs can share one GPU launch per decode step instead of each LoRA decoding in isolation.
3.1.2 Training
Training is run across tenants with one active LoRA adapter that trains on the GPU at a time while the rest of the LoRA adapters sit in pinned CPU memory.
Each tenant's state lives in an AdapterStore, which holds:
LoRA parameters
FP32 master weights
optimizer moments
gradient buffers
The training engine swaps one tenant’s LoRA state from pinned CPU memory onto the GPU, runs a single forward_backward pass on that tenant’s batched inputs, then swaps it back off so the next tenant can train. Note: this training path is still single-adapter, so the multi-LoRA concurrency gains we see in inference do not yet apply to training.
3.1.3 Weight Synchronization
After each optimization step, updated LoRA weights are loaded in-place into the inference engine. The scheduler does not freeze and the base model stays resident, so other tenants keep decoding while weights continuously update. The result behaves more like a continuously running distributed service sending small LoRA weights, than a collection of independent training jobs loading large models.
3.2 Comparison
The system described above scaled to eight concurrent multi-LoRA runs and achieved 3.14× higher end-to-end experiment throughput when run concurrently versus serially. Here is a summary of the key design choices that enable these experiment-throughput gains.
Dimension | Single-tenant RL | Multi-LoRA RL |
|---|---|---|
Job Startup | Full restart every job: process groups, checkpoint load, inference engine warmup | Adapter attaches to a persistent, warm training engine |
Memory Cost | Weights + gradients + optimizer state for full model on GPU | Frozen large base model weights + small adapter weights, gradients, and optimizer states |
Experiment Throughput | One large model can monopolize the entire cluster | N adapters share one set of nodes; throughput multiplies by N |
GPU Utilization | Imbalanced trainer/generator; off-policy async RL; idle GPUs during tool calls; rollouts and training run in lockstep in sync RL | Cross-job load balancing multiplexes rollouts and fills idle capacity dynamically |
Scaling Ceiling | Number of nodes | Adapter density on base model |
4. Experiments
4.1 Set Up
We tested multi-LoRA training on a single H200 node with Qwen3-4B-Instruct-2507, running sync RL on GSM8K in an agentic setting. To do so, we reframed GSM8K as a tool use learning task where the model must decide when to invoke custom tools like calculator and how to structure its response through a final_answer tool call. This makes the environment substantially richer than the standard one-shot benchmark and provides a stronger training-signal to learn from. The policy initially starts with no knowledge of how to use tools and sits at around 40% accuracy at step 0 in comparison to the 90%+ you'd see on standard GSM8K. With the right learning algorithm, the model learns how to climb to 90%+ by the end of step 9.
For fair comparison, we swept the baseline runtime configuration values with nine baseline runs for low variance and fast runtimes in the multi-LoRA case and use the following set-up:
Hardware: 1× H200 node, four inference + four training GPUs
Inference engine: vLLM,
tensor_parallel_size=4,max_loras=16, max_cpu_loras=16max_num_seqs=128Model: Qwen3-4B-Instruct-2507
Benchmark: GSM8K With Tools
RL config: 10 sync-RL steps
Tools: Calculator, Final Answer
Reward: 1.0 only if the model calls the Final Answer tool with the correct answer.
Mode: Agentic
Average Steps (Tool Calls): 5
Number of Concurrent, Multi-LoRA Runs: 1, 2, 4, 8
Additionally, we swept all single tenant configs for an optimized serial baseline for RL. This serial baseline has the same settings as the multi-LoRA run below except:
Inference engine: vLLM, tensor_parallel_size=8, max_loras=1, max_cpu_loras=1 max_num_seqs=1024
Serial Baseline Wall Clock Time: 2077s
Number of Serial, Single-Tenant Runs: 1
4.2 Metrics
There are many metrics to track when designing a cluster-wide training system. A researcher running a batch of experiments wants to know when all the experiment sweeps are in. Someone debugging a single hero run cares about individual step times. An infra team member may want to know how the average experiment behaves under load.
To capture this, we compare a batch of N concurrent multi-LoRA trainers against the same N runs executed serially, using four metrics:
Total Experiment Time. When is the whole batch of experiments done?
Mean Experiment Time. How long does a single experiment take on average?
First Experiment Time. When does the first experiment finish?
Step Time. How long does a single training step take inside an experiment?
The first two measure how a sweep, or cluster of experiments are performing as a whole. The second two measure how individual experiments are affected by the multi-LoRA training algorithm.
4.3 Speedups
We first compare eight concurrent multi-LoRA experiments with eight serial, single-tenant experiments. As shown below, the concurrent batch of eight experiments finished in less wall-clock time than three serial experiments run back-to-back, highlighting a 3.14× speedup in total experiment time.
While we observe that the first experiment time for concurrent training finishes 2.39× slower than the first serial run, we see that the mean experiment time improves with each concurrent experiment finishing 1.94× faster than the average serial experiment.

4.4 Scaling LoRAs
We swept from one to eight concurrent jobs to see how the speedup scales with respect to number of LoRAs. Total Experiment Time grows linearly for the serial baseline but sub-linearly for multi-LoRA across all N=2, 4, and 8 experiments, with the gap widening the most at eight jobs. This makes sense. In the serial baseline, GPUs remain underutilized during every tool-call while the model waits for a response from the environment. In addition, at the end of each batch of rollouts, the GPUs remain idle again for the entire duration of the synchronous training step. Multi-LoRA parallelizes and fills in these gaps across jobs.

N | Speed Up in Total Experiment Time | Speed Up in Mean Experiment Time |
|---|---|---|
2 | 1.84× | 1.40× |
4 | 2.68× | 1.73× |
8 | 3.14× | 1.94× |
To rule out run-to-run noise, we repeated the comparison twice with results agreeing within five percent at every concurrency level.
4.5 Tradeoffs
Multi-LoRA improves overall experiment throughput but it increases per-step latency. As N, the number of experiments grows, two metrics degrade: First Experiment Time (how long the fastest experiment finishes) and Step Time (how long each RL step takes) relative to the serial baseline.

N | First Experiment Time | Step Time |
|---|---|---|
2 | 0.96× (4% slower) | 213 s/step (1.04× slower) |
4 | 0.79× (27% slower) | 295 s/step (1.44× slower) |
8 | 0.59× (70% slower) | 490 s/step (2.39× slower) |
Step time grows sub-linearly with N as inference load increases, a sign that the engine is becoming more saturated. The tradeoff is that individual experiments wait longer. Specifically, at N=8 experiments, the first experiment finishes 70% slower in the concurrent, multi-LoRA setting in comparison to serial, single-tenant training.
4.6 Accuracy Checks
To validate training stability, we confirmed that multi-LoRA matches the serial baseline accuracy throughout training. In the graph below, we observe that the concurrent training run reward_accuracy tracks the serial baseline within ±1σ throughout all steps in training. Most importantly, all concurrent trainers reach reward_accuracy > 0.9 by step 9 at every concurrency level (N=1,2,4,8) we tested.

5. Getting Started
One of the goals of this work is to make multi-LoRA RL training infrastructure accessible outside a small number of large internal teams. Here is a simple way to get started on a 8× H1/200 setup using open-source libraries from SkyRL and the Tinker cookbook.
The setup below uses:
4 GPUs for Megatron training
4 GPUs for vLLM rollout generation
4 LoRA slots for concurrent, GSM8K training
Qwen3-4B-Thinking-2507 as the base model
5.1 Launch a SkyRL Multi-LoRA Endpoint
First, start a multi-LoRA, always-hot training endpoint. To avoid CUDA version mismatches, we recommend running these commands inside the Docker image novaskyai/skyrl-train-ray-2.51.1-py3.12-cu12.8:
5.2 Run Concurrent RL Experiments
Next, launch four RL jobs against the same endpoint. Each job trains its own LoRA adapter on the GSM8K task and stops after 10 RL training steps.
5.3 What to Expect
A working SkyRL endpoint validated with
curl -fsS <http://127.0.0.1:8000/api/v1/healthz>Four independent log directories appearing under
/tmp/tinker-examples/multilora-rl-{0,1,2,3}Four concurrent GSM8K RL runs sharing one warm vLLM + Megatron backend and creating 4 separate LoRA adapters
Four experiments with accuracy that rises from X → Y in N minutes which is roughly the same time it would take to train 1 LoRA adapter serially!
6. Future Work
We see the system described above as an early initial design rather than a finished architecture. There are several directions we are excited about exploring in the near future:
Higher adapter concurrency: we have scaled to eight concurrent LoRA experiments. Next, can we push the number of LoRA adapters even higher?
Larger models: so far, we have tested primarily on mid-sized models like Qwen3-4B and Nemotron-30B. Next, can we extend this framework to frontier-scale models with trillions of parameters?
Training-side multiplexing: today, most of the throughput gain comes from inference, while training remains serialized across tenants. Next, can we bring vLLM-style multiplexing into training so that training optimization can run concurrently as well?
7. Closing Thoughts
At Trajectory, we are building the platform for continual learning.
We expect models to improve quickly through tight train-deploy loops that seamlessly incorporate feedback, evaluations, and new behaviors. As that happens, the boundary between training and production systems will begin to fade.
Our goal with this work is to make that shift more accessible through open-source, concurrent, always-hot RL training infrastructure.
If you are excited about building such systems, we’d love to hear from you at hello@trajectory.ai.
8. Acknowledgments
We’d like to thank the SkyRL team: Eric Tang, Charlie Ruan, Philipp Moritz, and Sumanth Hegde for their work in building out core components of the Multi-LoRA training stack in SkyRL
We’d like to thank the Google Cloud team for their partnership and guidance in shaping the Kubernetes orchestration of our training platform.
We’d like to thank Dian Ang Yap, Jerry Chan, Ronak Malde, Michael Elabd, Hersh Godse, Arjun Karanam, Neil Kale, Irene Han, and Albert Li for their contributions to the technical portions of this blog post.
9. Additional Experiments
We ran the same multi-LoRA, concurrent experiments vs serial baselines on the long-context τ-bench retail task (with custom tools) using the MoE model NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.
At N=2, Multi-LoRA finishes 10 steps of training 1.28x faster than running the jobs serially end to end. The tradeoff remains that each tenant’s step time increased by 1.57x because serving now has higher inference workloads to serve across multiple LoRA adapters and training still process each training batch one by one.

Caption: On τ-bench retail with Nemotron 30B MoE, Multi-LoRA reduces total wall time for the two-job batch, while regressing on average step time
These results match what we saw on smaller models (Qwen3-4B) and simpler benchmarks (GSM8K), giving us confidence to scale this approach to larger models and more demanding workloads next.



