TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models

Abstract

Predicting the near-term behavior of a reactive agent is crucial in many robotic scenarios, yet remains challenging when observations of that agent are sparse or intermittent. Vision-Language Models (VLMs) offer a promising avenue by integrating textual domain knowledge with visual cues, but their one-shot predictions often miss important edge cases and unusual maneuvers. Our key insight is that iterative, counterfactual exploration--where a dedicated module probes each proposed behavior hypothesis, explicitly represented as a plausible trajectory, for overlooked possibilities--can significantly enhance VLM-based behavioral forecasting. We present TRACE (Tree-of-thought Reasoning And Counterfactual Exploration), an inference framework that couples tree-of-thought generation with domain-aware feedback to refine behavior hypotheses over multiple rounds. Concretely, a VLM first proposes candidate trajectories for the agent; a counterfactual critic then suggests edge-case variations consistent with partial observations, prompting the VLM to expand or adjust its hypotheses in the next iteration. This creates a self-improving cycle where the VLM progressively internalizes edge cases from previous rounds, systematically uncovering not only typical behaviors but also rare or borderline maneuvers, ultimately yielding more robust trajectory predictions from minimal sensor data. We validate TRACE on both ground-vehicle simulations and real-world marine autonomous surface vehicles. Experimental results show that our method consistently outperforms standard VLM-driven and purely model-based baselines, capturing a broader range of feasible agent behaviors despite sparse sensing.

Methodology

TRACE operates through an iterative three-component cycle: (i) Hypothesis Generation, where a VLM analyzes sparse observations to propose initial behavior hypotheses; (ii) Counterfactual Exploration, where a critic identifies overlooked edge-cases; (iii) Self-Improvement integrates both valid and rejected hypotheses into the VLM context for enhanced predictions.

Experimental Videos

We evaluate TRACE in scenarios involving two agents, an observer and a target, operating in a shared environment. Due to real-world constraints (such as limited sensor range or communication bandwidth restrictions), the observer receives only sporadic measurements of the target agent's state. Our goal is to predict the target's behavior hypotheses--trajectories--despite this measurement sparsity, enabling effective decision-making by the observer. We present three example tasks demonstrating our approach in action. Each video highlights a distinct scenario, showcasing both the baseline predictions and the refined outcomes after iterative adjustments.

Task 1

Task 2

Task 3

Results

Our experiments reveal four key insights: enhanced hypothesis coverage, discovery of rare maneuvers, self-improving VLM outputs, and reduced invalid trajectories.

Key Finding 1

Counterfactual exploration expands the behavioral hypothesis space. TRACE significantly outperforms baseline methods by achieving 84.6–93.1% coverage compared to 57–64% for the next-best approach. The table below summarizes the coverage ratios across tasks T1–T5.

Method	T1	T2	T3	T4	T5
CoT	0.2%	2.1%	1.8%	2.6%	3.8%
GIoT	58.3%	62.7%	50.7%	64.2%	59.5%
ToT	47.1%	51.4%	40.4%	63.2%	59.0%
TRACE (Ours)	84.6%	87.3%	82.9%	93.1%	90.4%

Key Finding 2

Iterative tree-of-thought expansion reveals rare but critical maneuvers. By exploring creative, edge-case trajectories, TRACE uncovers unconventional yet valid maneuvers – such as unexpected passing strategies and nuanced right-turn alternatives – that baseline methods consistently miss.

Click to enlarge image to view comparison with baselines

Key Finding 3

VLM's outputs self-improve through iterative counterfactual exposure. As measurement updates accumulate, the VLM learns from counterfactual feedback – increasing its diversity of valid trajectory proposals by 31.8% by the fifth update.

Graph showing increase in trajectory diversity

Key Finding 4

Iterative world model feedback teaches VLMs to reduce invalid trajectories. By incorporating negative feedback from the world model, TRACE substantially lowers the rate of invalid predictions, especially in complex maritime scenarios.

Update	T1	T2	T3	T4	T5
M1	24.8%	26.3%	25.7%	22.4%	23.9%
M2	19.5%	21.7%	22.3%	18.1%	19.2%
M3	14.2%	15.6%	17.8%	13.2%	14.7%
M4	9.8%	11.3%	12.6%	8.7%	10.1%
M5	7.2%	8.5%	9.4%	6.8%	7.9%

Supplementary Video

Supplementary video.

Citation

Please use the following BibTeX entry to cite our work:

@misc{trace2025,
  title = {TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models},
  author = {Gokul Puthumanaillam and Paulo Padrao and Jose Fuentes and Pranay Thangeda and William E. Schafer and Jae Hyuk Song and Karan Jagdale and Leonardo Bobadilla and Melkior Ornik},
  year = {2025},
  eprint = {2503.00761},
  archivePrefix = {arXiv},
  primaryClass = {cs.RO},
  url = {https://arxiv.org/abs/2503.00761}
}