Sam Huang | UCSC OSPO

Final Report for Smart Environments

Wed, 05 Nov 2025 00:00:00 +0000

Introduction

The process of creating the necessary software environment for code to run is a significant challenge in software development. Given a piece of open-source software intended for research, setting up the environmental dependencies to run the software could take significant manual effort. Existing automation methods struggle due to the complexity of managing diverse languages, dependencies, and hardware. In Smart Environments, I have created ENVAGENT, a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

To assess this capability, a new benchmark, ENVBENCH, was created, containing 54 popular projects across seven languages. Results show ENVAGENT dramatically improves environment construction compared to current agents (+16.2%). Furthermore, the system shows initial promise in dynamically adjusting cloud-based hardware resources based on the code’s needs.

Method

EnvAgent

The EnvAgent I created during my time at OSRE utilizes a multi-agent workflow to automatically build software execution environments. The process is structured into three phases: preparation, construction, and refinement.

Phase 1 (Preparation): Specialized agents collect information about the software repository – its structure, relevant files, and the host system’s hardware specifications (CPU, memory, etc.). This data is then used by a planning agent to generate a detailed, step-by-step instruction set for creating a functional Dockerfile.

Phase 2 (Construction): Two agents work in tandem: one generates or modifies the Dockerfile based on the plan, while the other executes the Dockerfile within an isolated container, capturing any errors.

Phase 3 (Refinement): A final agent analyzes the container execution data, identifying areas for improvement in the Dockerfile. This process repeats until a stable, executable environment is achieved.

To improve efficiency, EnvAgent incorporates rule-based tools for predictable tasks like directory setup and log management, reducing the need for complex agent reasoning. This combination of intelligent agents and automated routines (“scaffolding”) ensures a robust and adaptive system.

EnvEval Benchmark

In addition to the agent, one significant contribution is the manual curation of a benchmark that measures the quality of generated environments. EnvEval is a benchmark specifically designed to assess environment setup qualities across 54 carefully curated open-source repositories. They are chosen from both Chameleon reproducible artifacts and Multi-SWE-bench dataset. EnvEval contains json rubrics that can be used to automatically determine the quality of constructed environments.

Each rubric is divided into three parts, corresponding to three major objectives that a successfully constructed environment should have:

Structure: Checks for basic directory structure, file presence, and environment variables.
Configuration: Asks the question “Is this configured?”, checks for whether dependencies have been correctly configured.
Functionality: Asks the question “Is this usable?”, runs actual tests to see if the functionalities are present.

There are many tests in each category, and their weights are adjusted based on their importance.

Evaluation

Baseline Systems:

The study compared EnvAgent to two established automated code generation systems: one utilizing Anthropic’s advanced reasoning models and the other employing OpenAI’s code-focused models. These systems were chosen for their strong performance in creating software code and their prevalence in automated engineering processes. Both baselines were given full access to the target software repositories and complete details about the host system’s hardware.

Evaluation Metrics:

The performance of EnvAgent was assessed using three key metrics. These included the ability to create working environments, the quality of those environments, and a single combined score. Results showed EnvAgent significantly outperformed the baselines, achieving a 33.91% improvement in the final overall score – reaching 74.01, which was higher than the best baseline score of 30.10. This suggests EnvAgent produced both more functional environments and ensured greater accuracy through extensive testing.

Conclusion

The process of creating the necessary software environments for code agents is a major hurdle in scaling up research and development. Currently, this task relies heavily on manual labor. To address this, a new system, ENVAGENT, was created to automatically build these environments using intelligent agents and by understanding dependencies. A new benchmark, ENVBENCH, was also developed to assess this system’s effectiveness. Preliminary results demonstrate a significant improvement – ENVAGENT achieved a 33.91% increase in success rates compared to existing automated agents, representing a substantial step towards more efficient and reproducible research.

Thank you!

Autofill

; 20251105-Sam_Huang

Midterm for Smart Environments

Thu, 24 Jul 2025 00:00:00 +0000

What is EnvGym?

EnvGym is a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

EnvGym addresses this gap by leveraging LLM-powered agents to analyze project instructions, resolve dependencies, configure execution environments, and validate results—thereby reducing human overhead and improving reproducibility at scale.

Progress

New Tools

Initially, our agent had access to only one tool: the command line. This constrained the agent’s ability to decompose complex tasks and respond flexibly to failures. Over the last few weeks, we introduced a modular tool system, enabling the agent to handle specific subtasks more effectively.

The new toolset includes:

dockerrun: Executes Dockerfiles.
hardware_checking, hardware_adjustment: Tailor builds to available resources.
history_manager, stats: Tracks historical data for improvement and reproducibility.
planning: Generates high-level execution plans.
summarize: Interprets build results to adjust subsequent iterations.
writing_docker_initial, writing_docker_revision: Generate and refine Dockerfiles.

While some of those tools, such as dockerrun, run programmatic scripts, other scripts such as planning are more complex and use LLMs themselves.

Agent Re-Architecture: Moving Beyond Codex

We transitioned away from OpenAI’s Codex agent implementation. While powerful, Codex’s framework was overly reliant on its CLI frontend, which added unnecessary complexity and limited customizability for our research context.

We implemented our own lightweight, customizable agent pipeline that integrates LLM-based planning with iterative execution. Conceptually, the agent executes the following loop:

Repo Scanning
Hardware Check
Planning & Initial Dockerfile Generation
Docker Execution
Progress Summarization & Adjustment
Iterative Dockerfile Refinement (up to 20 rounds)
Success Check & Logging

This new agent design is easier to control, extend, and debug—aligning better with the needs of reproducibility research.

Prompt Engineering

For each tool that requires LLMs to function, we created a set of custom prompts that outline the task and breaks down the goals. For instance, the prompt used in summarize differs from the one in planning, allowing us to optimize the behavior of LLM agents per context.

Performance Gains

With these improvements, EnvGym now successfully replicates 9 repositories, surpassing our baseline Codex agent which struggled with the same set. We’ve observed more reliable planning, better handling of edge-case dependencies, and faster convergence in iterative Dockerfile revisions.

Next Steps

Granular Evaluation Metric

We plan to adopt a tree-structured rubric-based evaluation, inspired by PaperBench. Instead of binary success/failure, each repo will be assigned a reproducibility score from 0–100.

Key tasks include:

Rubric Design: Define a hierarchical rubric with criteria like dependency resolution, test success rate, runtime match, etc.
Manual Annotation: Build a dataset of ground-truth rubrics for a subset of repos to calibrate our automatic judge.
Judge Implementation: Develop an LLM-based judge function that takes (i) rubric and (ii) environment state, and returns a reproducibility score.

Source: Starace, Giulio, et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research.” arXiv preprint arXiv:2504.01848 (2025).

This will make EnvGym suitable for benchmarking. We will run our new method and obtain a score to compare with baseline methods!

Conclusion

EnvGym has made strong progress toward automating reproducibility in computational research. Through modularization, agentic design, and prompt optimizations, we’ve surpassed existing baselines and laid the groundwork for even more improvement.

The upcoming focus on metrics and benchmarking will elevate EnvGym from a functional prototype to a standardized reproducibility benchmark tool and also quantitatively prove that our new agentic method is better than existing tools such as Codex. Excited for what’s to come!

Autofill

; 20250724-Sam_Huang

Smart Environments – An AI System for Reproducible Custom Computing Environments

Mon, 16 Jun 2025 00:00:00 +0000

Hi everyone, I’m Sam! I’m excited to be working with the Argonne National Laboratory and SoR this summer on Smart Environments. Have you ever encountered a great opensource project and wanted to run it or use it locally, only to find that it’s such a headache to set up all the dependencies? Maybe your system version wasn’t correct, or a piece of software was outdated, or the dependencies were incompatible with something you had already on your machine?

In comes EnvGym to save the day! We want EnvGym to be an agent that would help reproduce opensource projects by automatically setting up the environmental dependencies required to get them running. That’s what I will be working on for the rest of the summer! To make EnvGym work, we will be leveraging LLM agents to tackle the problem. We will use EnvGym to read documentations, understand code structures, run commands to set up environments, and reflectively react to any errors and warnings.

To build EnvGym, I have the following to-do’s in mind:

Building a dataset that includes repos to be reproduced
Establishing a baseline using current methods
Implementing the actual EnvGym algorithm
Testing EnvGym against baseline performance and iteratively improving it
Deploying EnvGym to real-world use cases and gathering feedback

Here is the repo that we are working on: https://github.com/EaminC/EnvGym/tree/main

More updates to come, thanks for reading!