AutoSpec
Automating the Refinement of Reinforcement Learning Specifications
Tanmay Ambadkar, Đorđe Žikelić, Abhinav Verma (Accepted at ICLR 2026)
A novel framework that automatically transforms "coarse" or under-specified logical objectives into refined, guidance-rich specifications, enabling RL agents to master complex tasks where standard methods fail.
Why do concise specifications fail?
Reinforcement Learning (RL) has achieved remarkable feats, but specifying what an agent should do is challenging. Manually designing scalar reward functions is an art form, and slight flaws can lead to poor behavior.
Logical specifications (like "Reach Goal while Avoiding Obstacles") offer a promising, interpretable alternative. However, humans tend to write "coarse" specifications. For example, "Reach the Kitchen" is a valid goal, but if the kitchen is down a winding hallway with trap states (like a staircase), a standard RL agent will struggle to discover the path using only the sparse feedback from the coarse specification.
Trap States: Coarse regions may overlap with unrecoverable states.
Lack of Waypoints: Long-horizon tasks are difficult without intermediate sub-goals.
Overly Broad Goals: Large target regions dilute the learning signal.
Specification Refinement Problem
Given an initial specification , we search for a refined specification such that:
"Satisfaction of the refined spec guarantees satisfaction of the original, but is easier to learn."
How AutoSpec Works
AutoSpec acts as a wrapper around specification-guided RL algorithms. It monitors the learning process to identify why a policy fails and autonomously refines the specification graph.
Monitor
Track success rates of edge policies in the abstract specification graph.
Diagnose
Collect failure and success trajectories when policies underperform.
Refine
Apply targeted refinement strategies (SeqRefine, AddRefine, PastRefine, OrRefine).
Re-train
Train the policy on the new, easier-to-learn specification.
The Four Pillars of Refinement
AutoSpec employs four targeted procedures to address specific failure modes while maintaining logical soundness.
SeqRefine: Refining Predicates
Problem: Target region is too broad or contains trap states.
Solution: Automatically tightens the bounds of target regions () and safety constraints () using convex hulls of successful exploration traces. This effectively shrinks the target to exclude "unreachable" or dangerous areas.
- Removes "unreachable" parts of goal regions.
- Identifies and excludes trap states.
(Trap State Elimination)
AddRefine: Adding Waypoints
Problem: Path is too long for a single policy to learn reliably.
Solution: Decomposes long-horizon tasks by identifying stable "midpoints" in successful trajectories. It splits an edge into , creating two shorter, more manageable sub-tasks.
- Breaks complex paths into learnable segments.
- Reduces the effective horizon for the RL agent.
(Waypoint Introduction)
PastRefine: Source Partitioning
Problem: Some start states in a region are doomed to fail due to dynamics or obstacles.
Solution: Learns a separating hyperplane (via SVM) between initial states that lead to success and those that fail. It creates a new node for the "good" starts, focusing learning only where success is possible.
- Focuses learning on viable starting conditions.
- Improves reliability in stochastic environments.
(Source Partitioning)
OrRefine: Alternative Paths
Problem: The direct path is blocked or infeasible.
Solution: Discovers blocked paths and automatically wires new edges to alternative parent nodes in the specification graph. This enables the agent to backtrack or take entirely different routes (e.g., Path B instead of Path A).
- Enables dynamic routing around obstacles.
- Handles complex topology changes.
(Alternative Path Discovery)
Scalability in Randomized Environments
We evaluated AutoSpec on a challenging "100-Rooms" domain where wall configurations and predicate locations were fully randomized for each seed.
The "Bridge" Bottleneck
In 80% of random seeds, agents got stuck at narrow passages ("bridges") between key regions. Standard methods (like DiRL) often plateau at 20% success rates due to these bottlenecks. AutoSpec autonomously identifies them and deploys targeted refinements (mostly AddRefine and ReachRefine) to boost success rates to over 90%.
(Success Probability Comparison Curve)
Demonstrated Impact
Achieved on "Bridge" bottlenecks in randomized 100-Room environments (vs < 20% baseline).
Improvement in task completion for high-dimensional robotic manipulation (PandaGym).
No manual reward engineering or hand-crafted heuristics required.