Learning What to Say to Your VLA:
Mostly Harmless Vision-Language-Action Model Steering

We learn a language feedback policy that steers a frozen VLA — deciding what to say and, crucially, when not to say it.

Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy

Robotics Institute, Carnegie Mellon University

Under review, 2026

Overview

Vision-Language-Action (VLA) models let natural language serve as a flexible test-time interface to robot control. Language is appealing because it operates at a higher level of abstraction than low-level motor commands: if a frozen VLA already contains useful low-level skills, changing the language input can elicit and compose those skills without overwriting the underlying action policy. In principle, this yields better sample complexity and out-of-distribution generalization than directly fine-tuning the VLA.

In practice, though, a VLA's language-to-action mapping is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, the open-vocabulary search space is combinatorially large, and some behaviors may not be steerable at all. We propose a framework that interactively searches for language sequences that improve closed-loop VLA performance, distills them into a test-time language feedback policy (LFP), and learns an improvement head that predicts when steering will help. We then conformalize the improvement head so the LFP steers only when it is reliable and otherwise falls back to the base instruction. Operating on arbitrary frozen VLAs — with no access to the original training data and no fine-tuning of the model — our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% on hardware, with strong harmlessness guarantees under visual and semantic perturbations.

Steer at a higher abstraction
Interactive language steering generalizes better than direct VLA fine-tuning across visual and semantic perturbations and novel behavior compositions, with better sample complexity — matching VLA fine-tuning using as little as one-fifth of the on-policy data.
Know when not to steer
A conformalized improvement head refuses to steer when language is unlikely to help, preventing performance degradation out-of-distribution without hurting in-distribution success, with a provable false-positive-rate guarantee.
Closed-loop recovery
Because feedback is issued closed-loop, the policy discovers recovery behaviors — re-eliciting existing low-level skills at the right time — that are not observed with open-loop prompt rephrasing.

Method

We instantiate our approach in three phases. Use the tabs to step through each one.

Method phase figure

Results

We evaluate in simulation (LIBERO-OOD) and on hardware (Franka Emika), asking: (Q1) how does language steering compare to the base VLA and direct VLA fine-tuning in in-distribution performance, out-of-distribution robustness, and sample complexity? (Q2) can conformal calibration prevent harmful steering? (Q3) is closed-loop feedback necessary, or is open-loop prompt search enough? Our LFP πRFT (with refusal) and its calibrated variant πCP are compared against the base VLA (πVLA), direct VLA fine-tuning (πVLA-SFT), the off-the-shelf VLM (πBase), and the narrated SFT policy (πSFT). Hover any point or bar for its value ± standard error.

Qualitative Hardware Rollouts

Each pair shows base VLA failure (left) and language steered VLA success (right) for the selected hardware task and perturbation. Choose a task, then a perturbation condition.

Condition
Base VLA (failure) vs. Language Steered VLA (success)
Base VLA (failure)
Language Steered VLA (success)
0:00 / 0:00

Closed-Loop Recovery

Closed-loop steering provides robustness to adversarial mid-task perturbations. After the robot correctly places the green cube, a person moves it back into the scene. The base VLA continues from the original instruction and misplaces the cube, while πRFT observes the changed state, emits an updated language action ℓt (“Put the green block in the blue bin”), and steers the frozen VLA to recover — a behavior not seen with open-loop prompting. (Playback 2×.)

Base VLA vs. Language Steered VLA
Instruction (ℓtask): "Put the green cube in the blue basket and the red cube in the orange basket."
Base VLA
Language Steered VLA (Ours)
0:00 / 0:00

Simulation (LIBERO-OOD)

In simulation we steer a π0.5 VLA fine-tuned on LIBERO. We report mean success rate with standard error, pooled over 200 visual×semantic perturbation combinations.

Baselines — training vs. deployment success rate

Each point is a method: x = language-policy training success, y = held-out deployment success (up & right is better). Bars are standard error. n = 2,500 (train) / 7,500 (deploy) episodes.
Each method is shown in the LFP training environment vs. held-out deployment environments. The narrated policy πSFT improves training success but degrades deployment success — narration alone is not enough for robustness. Interactive search and calibration recover it: πRFT and πCP outperform every baseline, while direct fine-tuning (πVLA-SFT) helps only in training-like conditions. Absolute success is higher in deployment because LIBERO-10 is near-saturated for the base VLA.

On-Policy Rollout Usage

Success rate vs. successful on-policy rollouts per task. n = 10,000 episodes.
A favorable exchange rate for language steering: πRFT trained on only 10 successful rollouts already matches or exceeds πVLA-SFT trained on 50one-fifth the fine-tuning data. πVLA-SFT plateaus by 50 rollouts while πRFT keeps improving, since adapting the language interface uses data more efficiently than fine-tuning the action policy.

Novel Behavior Composition

Unseen compositions of learned behaviors (Compose). n = 300 episodes.
πRFT steers the frozen VLA to solve unseen Compose tasks (novel combinations of learned behaviors), whereas direct VLA fine-tuning slightly degrades over the base VLA.

Hardware (Franka Emika)

On hardware we steer a π0.5 VLA zero-shot (no fine-tuning) across four tabletop tasks and two novel tasks, each with visual (VOOD) and semantic (SOOD) perturbations. Select a task:

Task

CubeSort

Success rate (%) over 30 trials per condition.
πRFT improves over the base VLA on the steerable tasks (CubeSort, CubeMug, MarkerBlock). On the less-steerable Microwave, uncalibrated steering can hurt under perturbation, but πCP refuses harmful interventions and recovers performance above the base VLA. On the two novel tasks, πRFT transfers zero-shot, beating both the base VLA and πSFT.

When (Not) to Steer: Calibrated Refusal

Refusal is not only a safety mechanism — it also improves task performance by steering selectively. Conformal calibration controls the rate of harmful steering at a chosen target α.

Refusal improves success while cutting harmful steering

Simulation, 10,000 pooled rollouts across 200 visual & semantic perturbation combinations.
Metric
Refusing to steer when language is harmful improves success. Without refusal, πNR reaches 70.93%; adding refusal (πRFT) raises it to 74.96%, and conformal calibration (πCP) to 75.96%. Toggle to False Positive Rate to see calibration cut harmful steering from 38.92% to 9.31% at the α = 0.10 target.

Calibration tracks the target false-positive rate

Reliability diagram: empirical FPR vs. target α, averaged over 5 randomized calibration/test splits.
As we sweep the target α, the empirical FPR tracks the diagonal (FPR = α), confirming the class-conditional conformal guarantee that harmful steering is bounded by α. The uncalibrated policy instead sits far above the diagonal at a fixed 38.92% FPR.
Confusion matrix of πRFT's language interventions (calibration at α = 0.10)
SettingMethodAccuracy ↑TPR ↑FPR ↓TNR ↑FNR ↓Refusal Rate
SimulationπRFT66.22%70.18%38.92%61.08%29.82%43.42%
πCP77.74%67.77%9.31%90.69%32.23%57.66%
HardwareπRFT84.72%100.00%61.11%38.89%0.00%9.72%
πCP99.17%99.63%2.22%97.78%0.37%24.72%
Search ablation (simulation): trajectory-level closed-loop search wins
SearchSuccess Rate (%)Refusal (%)
Open-loop (static prompt edit)71.3 ± 0.543.5
Closed-loop, phrase-level70.4 ± 0.545.9
Closed-loop, trajectory-level (Ours)75.0 ± 0.443.4

BibTeX

@article{jeong2026mostlyharmless, title = {Learning What to Say to Your VLA: Mostly Harmless Vision-Language-Action Model Steering}, author = {Jeong, Hyun Joe and Swamy, Gokul and Bajcsy, Andrea}, journal = {Under review}, year = {2026} }