Tanmay Ambadkar

About me

I am a PhD student in Computer Science at Penn State, advised by Dr. Abhinav Verma. My research focuses on building trustworthy AI by integrating reinforcement learning with the safety and verification principles of programming languages. I develop novel frameworks that enable agents to learn from imperfect human guidance and operate within critical safety constraints, paving the way for their use in complex, high-stakes environments.

Research Overview

My research is focused on making artificial intelligence more reliable and trustworthy. While reinforcement learning (RL) can train agents to perform incredibly complex tasks, it often struggles with three key challenges: it needs perfectly defined goals, it can behave unsafely, and it does not know how to handle conflicting objectives. My work tackles these problems by creating frameworks that allow people to guide AI with high-level instructions. I am building systems that can automatically fix imperfect instructions, a "safety shield" that prevents agents from taking dangerous actions, and methods that allow users to balance competing goals on the fly. Ultimately, the goal is to build AI that is not only powerful but also safe, interpretable, and collaborative enough to be deployed in critical real-world scenarios.

Objective 1: Making RL Robust to Imperfect Instructions

The Problem: Manually designing a perfect reward function to guide an RL agent is extremely difficult and a major barrier for non-experts. A slightly flawed specification can lead to completely wrong or unexpected behavior.

My Approach (AutoSpec): I developed a framework called AutoSpec that allows a user to provide an initial, high-level, and potentially imperfect specification. The agent then autonomously refines and corrects this specification during training by identifying and resolving inconsistencies, leading to better task performance without requiring constant human intervention. This makes RL more accessible to domain experts who need the technology but are not RL specialists.

Objective 2: Building a Scalable Safety Shield for RL Agents

The Problem: An RL agent trained to maximize a reward will do so at all costs, potentially violating critical safety constraints. Furthermore, standard safe RL algorithms (like CMDPs) often fail to learn when given only sparse, binary cost signals (e.g., "safe" or "unsafe").

My Approach (SPARKD): I developed SPARKD, a scalable framework that learns a globally linear model of the environment's complex non-linear dynamics using Deep Koopman Operators. This "lifted" representation allows the shield to use formal methods, specifically weakest precondition calculus, to efficiently analyze the safety of an agent's actions over a finite horizon. If a proposed action could lead to an unsafe state, the shield intervenes with a safe alternative.

My Approach (RAMPS): Building on this, RAMPS (Robust Adaptive Multi-Step Predictive Shielding) advances the state-of-the-art by introducing a robust, multi-step Control Barrier Function (CBF). This method not only uses a learned linear model but also explicitly accounts for model error and control delays. This allows RAMPS to provide stronger, continuous safety guarantees and has demonstrated over a 90% reduction in safety violations in complex, high-dimensional robotic simulations.

Next Steps (AutoCost): To solve the sparse cost problem, I am developing AutoCost, which extracts a rich, continuous cost signal directly from these safety shields. When an agent's proposed action is unsafe, the safety Quadratic Program (QP) in RAMPS or SPARKD requires a large "slack" value to find a safe alternative. AutoCost captures the magnitude of this slack variable, which directly quantifies *how unsafe* the proposed action was. This provides a dense, informative cost gradient, enabling non-shielded CMDP agents to learn safe policies far more effectively than with traditional binary costs.

Objective 3: Enabling User-Driven Trade-offs in Multi-Objective RL

The Problem: Real-world applications rarely have a single goal. More often, they involve balancing a set of conflicting objectives, such as maximizing performance while minimizing cost and energy consumption.

My Approach (D³PO): Decomposed, Diversity-Driven Policy Optimization My work on the DoD-funded MIXTAPE initiative addresses this with D³PO. This framework trains a single, robust policy capable of generating a wide spectrum of optimal behaviors. This allows a human operator to interactively tune the agent's objective priorities at runtime (without retraining) to adapt its strategy to changing mission requirements.

Next Steps:

Negative Preferences (Symmetric Scalarization): We are introducing a fundamentally new paradigm where negative preferences can be allowed, enabling an operator to actively penalize an objective. We will explore its impact by designing new environments and metrics and modifying D³PO to support this symmetric instruction space.
Non-Linear Scalarization: We are moving from linear combinations of objectives to non-linear scalarization functions. This will allow the agent to learn a richer class of non-convex trade-offs, giving the user a choice between multiple possible scalarization options.

Publications

Specification Guided Reinforcement Learning

Tanmay Ambadkar

Accepted at AAAI Doctoral Consortium

Abstract

While Reinforcement Learning (RL) has demonstrated remarkable success in solving complex sequential decision-making problems, its application in real-world, safety-critical systems is hindered by its reliance on carefully engineered reward functions. Designing effective rewards is notoriously challenging and can lead to unintended or unsafe behaviors, a phenomenon known as reward hacking. Specification-guided RL has emerged as a principled alternative, leveraging formal methods to directly encode high-level objectives, safety requirements, and behavioral constraints. However, the practical utility of this approach is often limited by coarse or under-specified logical formulas and the computational challenge of enforcing safety at scale. This thesis addresses these limitations by developing a unified framework for the automated refinement, scalable enforcement, and flexible adaptation of formal specifications in RL.

Robust Adaptive Multi-Step Predictive Shielding

Tanmay Ambadkar, Darshan Chudiwal, Greg Anderson, Abhinav Verma

Accepted at AAAI Student Abstract and Poster Program, In Submission at ICLR 2026

Paper | Homepage

Abstract

Reinforcement learning for safety-critical tasks requires policies that are both high-performing and safe throughout the learning process. While model-predictive shielding is a promising approach, existing methods are often computationally intractable for the high-dimensional, nonlinear systems where deep RL excels, as they typically rely on a patchwork of local models. We introduce RAMPS, a scalable shielding framework that overcomes this limitation by leveraging a learned, linear representation of the environment's dynamics. This model can range from a linear regression in the original state space to a more complex operator learned in a high-dimensional feature space. The key is that this linear structure enables a robust, look-ahead safety technique based on a multi-step Control Barrier Function (CBF). By moving beyond myopic one-step formulations, RAMPS accounts for model error and control delays to provide reliable, real-time interventions. The resulting framework is minimally invasive, computationally efficient, and built upon robust control-theoretic foundations. Our experiments demonstrate that RAMPS significantly reduces safety violations compared to existing safe RL methods while maintaining high task performance in complex control environments.

Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

Tanmay Ambadkar, Sourav Panda, Shreyas Kale, Abhinav Verma, Jonathan Dodge

In Submission at ICLR 2026

Paper | Homepage

Abstract

Multi-objective reinforcement learning (MORL) aims to optimize policies in environments with multiple, often conflicting objectives. While a single, preference-conditioned policy offers the most flexible and efficient solution, existing methods often struggle to cover the entire spectrum of optimal trade-offs. This is frequently due to two underlying challenges: destructive gradient interference between conflicting objectives and representational mode collapse, where the policy fails to produce diverse behaviors. In this work, we introduce D3PO, a novel algorithm that trains a single preference conditioned policy to directly address these issues. Our framework features a decomposed optimization process to encourage stable credit assignment and a scaled diversity regularizer to explicitly encourage a robust mapping from preferences to policies. Empirical evaluations across standard MORL benchmarks show that D3PO discovers more comprehensive and higher-quality Pareto fronts, establishing a new state-of-the-art in terms of hypervolume and expected utility, particularly in complex and many-objective environments.

Safer Policies via Affine Representations using Koopman Dynamics

Tanmay Ambadkar, Darshan Chudiwal, Greg Anderson, Abhinav Verma

In Submission at AAAI 2026

Abstract

Reinforcement learning for safety-critical tasks requires constructing a policy that prioritizes taking safe actions while optimizing performance. Moreover, in many applications it is important to maintain safety during training, not just at the end of the learning process. Model-predictive shielding using weakest preconditions is a promising framework that helps maintain safety during training and deployment, but current techniques are limited to low-dimensional state spaces (4 features) and struggle to scale to higher dimensional environments with complex environment dynamics. In this paper we present SPARKD, a highly scalable framework for model-predictive shielding which works by linearizing non-linear dynamics using a lifted representation of the state space. Our framework leverages the Koopman Operator theory to learn basis functions which augument the state space to capture highly non-linear transition models. The use of the lifted space by SPARKD allows for effective safety calculation using convex optimization and efficient safety analysis for shielding. Our experiments show that SPARKD is capable of learning performant policies with fewer safety violations than existing safe RL techniques.

AutoSpec: Automating the Refinement of Reinforcement Learning Specifications

Tanmay Ambadkar, Đorđe Žikelić, Abhinav Verma

Accepted at The Workshop on Post-AI Formal Methods at AAAI-26; Accepted at PLDI SRC 2024, In Submission at ICLR 2026

Paper | Poster

Abstract

Logical specifications have been shown to help reinforcement learning algorithms in achieving complex tasks. However, when a task is under-specified, agents might fail to learn useful policies. In this work, we explore the possibility of improving coarse-grained logical specifications via an exploration-guided strategy. We propose AutoSpec, a framework that searches for a logical specification refinement whose satisfaction implies satisfaction of the original specification, but which provides additional guidance therefore making it easier for reinforcement learning algorithms to learn useful policies. AutoSpec is applicable to reinforcement learning tasks specified via the SpectRL specification logic. We exploit the compositional nature of specifications written in SpectRL, and design four refinement procedures that modify the abstract graph of the specification by either refining its existing edge specifications or by introducing new edge specifications. We prove that all four procedures maintain specification soundness, i.e. any trajectory satisfying the refined specification also satisfies the original. We then show how AutoSpec can be integrated with existing reinforcement learning algorithms for learning policies from logical specifications. Our experiments demonstrate that AutoSpec yields promising improvements in terms of the complexity of control tasks that can be solved, when refined logical specifications produced by AutoSpec are utilized.

Scaling Strategy, Not Compute: A Stand-Alone, Open-Source StarCraft II Benchmark for Accessible RL Research

Sourav Panda, Tanmay Ambadkar, Shreyas Kale, Abhinav Verma, Jonathan Dodge

In Submission

Abstract

The research community lacks a middle ground between StarCraft II’s full game and mini-games. In the full game, sprawling state-action space renders reward signals sparse and noisy, but in mini-games, simple agents saturate performance with little genuine strategy. This yawning complexity gap hinders steady curriculum design and prevents many academic groups from experimenting with modern RL algorithms under realistic compute budgets. To fill this gap, we present the Two-Bridge Map, the first entry in an open-source benchmark series we purposely engineered as an intermediate benchmark to sit between these extremes. By disabling economy mechanics such as resource collection, base building, and fog-of-war, the environment isolates two core wargaming skills: long-range navigation and micro-combat. Two-Bridge ships as a lightweight, Gym-compatible wrapper on top of PySC2. Because it is a directly interactable environment rather than a replay-driven pipeline, researchers can train and evaluate any RL algorithm immediately---no downloading, filtering, or preprocessing of Blizzard replays required. Preliminary experiments show that agents learn coherent manoeuvring and engagement behaviours without imposing full-game computational costs. By open-sourcing the maps, wrappers, and reference scripts, we invite researchers to adopt the Two-Bridge Map Series as a standard benchmark.

MIXTAPE: Middleware for Interactive XAI with Tree-Based AI Performance Evaluation

Tanmay Ambadkar, Hayden Moore, Sourav Panda, Shreyash Kale, Connor Greenwell, Brianna Major, Aashish Chaudhary, Jonathan Dodge, Abhinav Verma and Brian Hu

Simulation Interoperability Standards Organization (SISO) SIMposium, 2025

Abstract Link

MIXTAPE: Middleware for Interactive XAI with Tree-Based AI Performance Evaluation

Brian Hu, Jonathan Dodge, Abhinav Verma, Tanmay Ambadkar, Sourav Panda, Sujay Koujalgi, Aashish Chaudhary, Brianna Major, and Bryon Lewis

Simulation Interoperability Standards Organization (SISO) SIMposium, 2024

Abstract Link

Optimizing Operational Costs in Combined Heat and Power Integrated District Heating Systems: A Reinforcement Learning Approach

Saranya Anbarasu, Tanmay Ambadkar, Rosina Adhikari, Kathryn Hinkelman, Zhanwei He, Wangda Zuo, Ardeshir Moftakhari

SimBuild, 2024

Paper

Abstract

As societies worldwide strive to reduce carbon footprints and transition toward cleaner energy sources, grid-integrated district energy systems (DES) emerge as a pivotal player in achieving these objectives. The escalating complexity of DES necessitates adaptive, synergistic, and hierarchical control of heterogeneous systems to achieve common energy and cost conservation goals. Prior research highlights several challenges of model-based control techniques for DES, such as limited access to computational tools, prolonged durations to digital twin development, and the complexities associated with control design. In contrast, model-free control methodologies appear as a viable alternative. As a response, our study explores a reinforcement learning-based (RL) supervisory control to minimize the operational costs in a university campus DES. To enhance overall system efficiency, we utilize resource flexibility to improve DES operations by responding to fluctuations in utility prices. In this paper, we demonstrate the toolchain, and virtual testbed development, engineer a suitable RL reward, along with the learning from challenges. From the case study, the RL agent showcases a significant 32% net operational cost savings and a 13% peak demand reduction compared to the conventional thermal load following control. This research signifies the potential of RL-based control systems in optimizing the performance of complex DES and multi-energy systems involving several control points.

A Simple Fast Resource-efficient Deep Learning for Automatic Image Colorization

Tanmay Ambadkar, Jignesh S. Bhatt

Color and Imaging Conference (CIC), 2023

Paper Code

Abstract

Colorization of grayscale images is a severely ill-posed inverse problem among computer vision tasks. We present a novel end-to-end deep learning method for the automatic colorization of grayscale images. Past methods employ multiple deep networks, use auxiliary information, and/or are trained on massive datasets to understand the semantic transfer of colors. The proposed method is a 38-layer deep convolutional residual network that utilizes the CIELAB color space to reduce the problem’s solution space. The network comprises 16 residual blocks, each with 128 convolutional filters to address the ill-posedness of colorization, followed by 4 convolutional blocks to reconstruct the image. Experiments under challenging heterogeneous scenarios and using the Imagenet, Intel, and MirFlickr datasets show significant generalization when assessed visually and against PSNR, SSIM, and PIQE. The proposed method is relatively simpler (16 million parameters), faster (15 images/sec), and resource-efficient (just 50000 training images) when compared to the state-of-the-art

Discrete Sequencing for Demand Forecasting: A novel data sampling technique for time series forecasting

N. Menon, S. Saboo, T. Ambadkar and U. Uppili

International Conference on Intelligent Data Science Technologies and Applications (IDSTA), 2022

Paper

Abstract

Accurately forecasting energy consumption for buildings has become increasingly important over the years owing to the increasing prices of energy. A good forecast gives an understanding of how much the expected load (demand) of the building would be in the coming days and months. This could be used in further planning of energy usage within the building. This also becomes important due to the dynamic nature of energy rates. With an accurate forecast, one could also aim for spot trading by which the energy is bought and sold at different rates in a daily fashion. We target short-term and medium-term demand forecasting for buildings. Data Sampling is an integral part of training time-series models. The temporal horizon along with the patterns captured contribute to the model learning and thus its forecasts. When the data is aplenty with more than one value per day, the traditional sliding window method is unable to forecast for short-term forecasts without the actual truth values because of its continuous nature. The forecasts deviate very quickly and become unusable. In this paper, we present a novel data sampling technique called Discrete Sequencing. This samples data sequences in a lagged fashion which looks at a much larger temporal horizon with a smaller sequence size. We demonstrate the efficacy of our sampling technique by testing the forecasts on three different neural network architectures.

Deep reinforcement learning approach to predict head movement in 360° videos

Tanmay Ambadkar, Pramit Mazumdar

Proc. IS&T Int’l. Symp. on Electronic Imaging: Image Processing: Algorithms and Systems, 2022

Paper Code

Abstract

The popularity of 360° videos has grown immensely in the last few years. One probable reason is the availability of low-cost devices and ease in capturing them. Additionally, users have shown interest in this particular type of media due to its inherent feature of being immersive, which is completely absent in traditional 2D videos. Nowadays such powerful 360° videos have many applications such as generating various content-specific videos (gaming, knowledge, travel, sports, educational, etc.), during surgeries by medical professionals, in autonomous vehicles, etc. A typical 360° video when seen through a Head Mounted Display (HMD) gives an immersive feeling, where the viewer perceives standing within the real environment in a virtual platform. Similar to real life, at any point in time, the viewer can view only a particular region and not the entire 360° content. Viewers adopts physical movement for exploring the total 360° content. However, due to the large volume of 360° media, it faces challenges during transmission. Adaptive compression techniques have been incorporated in this regard, which is in accordance with the viewing behaviour of a viewer. Therefore, with the growing popularity and usage of 360° media, the adaptive compression methodologies are in development. One important factor in adaptive compression is the estimation of the natural field-of-view (FOV) of a viewer watching 360° content using a HMD. The FOV estimation task becomes more challenging due to the spatial displacement of the viewer with respect to the dynamically changing video content. In this work, we propose a model to estimate the FOV of a user viewing a 360° video using an HMD. This task is popularly known as the Virtual Cinematography. The proposed FOVSelectionNet is primarily based on a reinforcement learning framework. In addition to this, saliency estimation is proved to be a very powerful indicator for attention modelling. Therefore, in this proposed network we utilise a saliency indicator for driving the reward function of the reinforcement learning framework. Experiments are performed on the benchmark Pano2Vid 360° dataset, and the results are observed to be similar to human exploration

Education

The Pennsylvania State University, University Park, PA, USA

Ph.D in Computer Science and Engineering

January 2024 - May 2027

GPA: 3.56

The Pennsylvania State University, University Park, PA, USA

M.S. in Computer Science and Engineering

August 2022 - December 2023

GPA: 3.9

Indian Institute of Information Technology, Vadodara, Gujarat, India

B.Tech in Computer Science and Engineering

August 2022 - May 2024

GPA: 9.4

Work Experience

Dept. of Architectural Engineering, The Pennsylvania State University

Research Assistant

August 2023 - Jan 2025

I worked on integrating Dymola simulation tools with Gymnasium, defining a robust multi-objective reward system and training multiple RL agents.

Dept. of Industrial and Manufacturing Engineering, The Pennsylvania State University

Research Assistant

May 2023 - July 2023

I worked on predicting cases of Autism Spectrum Disorder in children within the age of 24 months, using Electronic Health Records (EHR) data, by using time-series models.
Using PySpark, I preprocessed a dataset over 500GB to gather data for over 600000 patients. This data was used for cross-validation for multiple models to determine the best fit and most important features contributing to the final binary decision

Siemens Technology and Services

Research and Digitization Automation Intern

January 2022 - July 2022

I was responsible for detecting anomalies in data using AutoEnCoder models and using explainable AI (SHAP) to identify which features contribute to the anomalies. I developed a plug-and-play library for this having multiple layers of abstraction and obfuscation to protect intellectual property.
I experimented with deep learning models and modifying workflows to create a library which could be used with any time-series data from multiple sources (csv, sql, influxdb). Demonstrated Proof-of-concpt using DataBricks on Siemens Buildings data to an internal team.
I performed exploratory data analysis on coffee roaster data from Starbucks to identify burner cuts and its relation with other variables. I trained time-series models to predict burner cuts and developed a real-time prediction library to demonstrate its efficacy.

Siemens Technology and Services

Research and Digitization Automation Intern

May 2021 - July 2021

I worked on the Industrial Predictive Analytics Engine (IPAE), where I integrated workflow creation, management and progress monitoring using celery, redis and flask, which aided to running multiple workflows in parallel and reducing overall execution time of the pipeline by 30%.
I was responsible for maintaining and upgrading a library for realizing model training workflows for time-series datasets. This was used in the Dubai Expo 2021.

The Pennsylvania State University

Teaching Assistant

I was a Teaching Assistant for the course CMPEN 270: Digital Design: Theory and Practice, CMPSC221: Object Oriented Design & Web Programming and CMPSC448: Machine Learning and Algorithmic AI