Learning What to Say to Your VLA: Mostly Harmless VLA Steering

Overview

Vision-Language-Action (VLA) models let natural language serve as a flexible test-time interface to robot control. Language is appealing because it operates at a higher level of abstraction than low-level motor commands: if a frozen VLA already contains useful low-level skills, changing the language input can elicit and compose those skills without overwriting the underlying action policy. In principle, this yields better sample complexity and out-of-distribution generalization than directly fine-tuning the VLA.

In practice, though, a VLA's language-to-action mapping is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, the open-vocabulary search space is combinatorially large, and some behaviors may not be steerable at all. We propose a framework that interactively searches for language sequences that improve closed-loop VLA performance, distills them into a test-time language feedback policy (LFP), and learns an improvement head that predicts when steering will help. We then conformalize the improvement head so the LFP steers only when it is reliable and otherwise falls back to the base instruction. Operating on arbitrary frozen VLAs — with no access to the original training data and no fine-tuning of the model — our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% on hardware, with strong harmlessness guarantees under visual and semantic perturbations.

Steer at a higher abstraction

Interactive language steering generalizes better than direct VLA fine-tuning across visual and semantic perturbations and novel behavior compositions, with better sample complexity — matching VLA fine-tuning using as little as one-fifth of the on-policy data.

Know when not to steer

A conformalized improvement head refuses to steer when language is unlikely to help, preventing performance degradation out-of-distribution without hurting in-distribution success, with a provable false-positive-rate guarantee.

Closed-loop recovery

Because feedback is issued closed-loop, the policy discovers recovery behaviors — re-eliciting existing low-level skills at the right time — that are not observed with open-loop prompt rephrasing.

Method

We instantiate our approach in three phases. Use the tabs to step through each one.

Results

We evaluate in simulation (LIBERO-OOD) and on hardware (Franka Emika), asking: (Q1) how does language steering compare to the base VLA and direct VLA fine-tuning in in-distribution performance, out-of-distribution robustness, and sample complexity? (Q2) can conformal calibration prevent harmful steering? (Q3) is closed-loop feedback necessary, or is open-loop prompt search enough? Our LFP π^RFT (with refusal) and its calibrated variant π^CP are compared against the base VLA (π^VLA), direct VLA fine-tuning (π^VLA-SFT), the off-the-shelf VLM (π^Base), and the narrated SFT policy (π^SFT). Hover any point or bar for its value ± standard error.

Qualitative Hardware Rollouts

Each pair shows base VLA failure (left) and language steered VLA success (right) for the selected hardware task and perturbation. Choose a task, then a perturbation condition.

Condition

Base VLA (failure) vs. Language Steered VLA (success)

Base VLA (failure)

Language Steered VLA (success)

0:00 / 0:00

Closed-Loop Recovery

Closed-loop steering provides robustness to adversarial mid-task perturbations. After the robot correctly places the green cube, a person moves it back into the scene. The base VLA continues from the original instruction and misplaces the cube, while π^RFT observes the changed state, emits an updated language action ℓ_t (“Put the green block in the blue bin”), and steers the frozen VLA to recover — a behavior not seen with open-loop prompting. (Playback 2×.)

Base VLA vs. Language Steered VLA

Instruction (ℓ^task): "Put the green cube in the blue basket and the red cube in the orange basket."

Base VLA

Language Steered VLA (Ours)

0:00 / 0:00

Simulation (LIBERO-OOD)

In simulation we steer a π_0.5 VLA fine-tuned on LIBERO. We report mean success rate with standard error, pooled over 200 visual×semantic perturbation combinations.

Baselines — training vs. deployment success rate

Each point is a method: x = language-policy training success, y = held-out deployment success (up & right is better). Bars are standard error. n = 2,500 (train) / 7,500 (deploy) episodes.

Each method is shown in the LFP training environment vs. held-out deployment environments. The narrated policy π^SFT improves training success but degrades deployment success — narration alone is not enough for robustness. Interactive search and calibration recover it: π^RFT and π^CP outperform every baseline, while direct fine-tuning (π^VLA-SFT) helps only in training-like conditions. Absolute success is higher in deployment because LIBERO-10 is near-saturated for the base VLA.

On-Policy Rollout Usage

Success rate vs. successful on-policy rollouts per task. n = 10,000 episodes.

A favorable exchange rate for language steering: π^RFT trained on only 10 successful rollouts already matches or exceeds π^VLA-SFT trained on 50 — one-fifth the fine-tuning data. π^VLA-SFT plateaus by 50 rollouts while π^RFT keeps improving, since adapting the language interface uses data more efficiently than fine-tuning the action policy.

Novel Behavior Composition

Unseen compositions of learned behaviors (Compose). n = 300 episodes.

π^RFT steers the frozen VLA to solve unseen Compose tasks (novel combinations of learned behaviors), whereas direct VLA fine-tuning slightly degrades over the base VLA.

Hardware (Franka Emika)

On hardware we steer a π_0.5 VLA zero-shot (no fine-tuning) across four tabletop tasks and two novel tasks, each with visual (VOOD) and semantic (SOOD) perturbations. Select a task:

Task

CubeSort

Success rate (%) over 30 trials per condition.

π^RFT improves over the base VLA on the steerable tasks (CubeSort, CubeMug, MarkerBlock). On the less-steerable Microwave, uncalibrated steering can hurt under perturbation, but π^CP refuses harmful interventions and recovers performance above the base VLA. On the two novel tasks, π^RFT transfers zero-shot, beating both the base VLA and π^SFT.

When (Not) to Steer: Calibrated Refusal

Refusal is not only a safety mechanism — it also improves task performance by steering selectively. Conformal calibration controls the rate of harmful steering at a chosen target α.

Refusal improves success while cutting harmful steering

Simulation, 10,000 pooled rollouts across 200 visual & semantic perturbation combinations.

Metric

Refusing to steer when language is harmful improves success. Without refusal, π^NR reaches 70.93%; adding refusal (π^RFT) raises it to 74.96%, and conformal calibration (π^CP) to 75.96%. Toggle to False Positive Rate to see calibration cut harmful steering from 38.92% to 9.31% at the α = 0.10 target.

Calibration tracks the target false-positive rate

Reliability diagram: empirical FPR vs. target α, averaged over 5 randomized calibration/test splits.

As we sweep the target α, the empirical FPR tracks the diagonal (FPR = α), confirming the class-conditional conformal guarantee that harmful steering is bounded by α. The uncalibrated policy instead sits far above the diagonal at a fixed 38.92% FPR.

Confusion matrix of π^RFT's language interventions (calibration at α = 0.10)
Setting	Method	Accuracy ↑	TPR ↑	FPR ↓	TNR ↑	FNR ↓	Refusal Rate
Simulation	π^RFT	66.22%	70.18%	38.92%	61.08%	29.82%	43.42%
Simulation	π^CP	77.74%	67.77%	9.31%	90.69%	32.23%	57.66%
Hardware	π^RFT	84.72%	100.00%	61.11%	38.89%	0.00%	9.72%
Hardware	π^CP	99.17%	99.63%	2.22%	97.78%	0.37%	24.72%

Search ablation (simulation): trajectory-level closed-loop search wins
Search	Success Rate (%)	Refusal (%)
Open-loop (static prompt edit)	71.3 ± 0.5	43.5
Closed-loop, phrase-level	70.4 ± 0.5	45.9
Closed-loop, trajectory-level (Ours)	75.0 ± 0.4	43.4

BibTeX

@article{jeong2026mostlyharmless, title = {Learning What to Say to Your VLA: Mostly Harmless Vision-Language-Action Model Steering}, author = {Jeong, Hyun Joe and Swamy, Gokul and Bajcsy, Andrea}, journal = {Under review}, year = {2026} }