<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Observability | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/observability/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/observability/index.xml" rel="self" type="application/rss+xml"/><description>Observability</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Thu, 25 Sep 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>Observability</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/observability/</link></image><item><title>Final Update: Building Intelligent Observability for NRP</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250925-manish-reddy/</link><pubDate>Thu, 25 Sep 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250925-manish-reddy/</guid><description>&lt;p>I&amp;rsquo;m excited to share the completion of my OSRE 2025 project, &amp;ldquo;&lt;em>Intelligent Observability for NRP: A GenAI Approach&lt;/em>&amp;rdquo; and the significant learning journey it has been. We&amp;rsquo;ve successfully developed a novel InfoAgent architecture that delivers on our core goal: building an ML-powered service for NRP that analyzes monitoring data, detects anomalies, and provides trustworthy GenAI explanations.&lt;/p>
&lt;h2 id="how-our-novel-infoagent-architecture-advances-the-observability-mission">How Our Novel InfoAgent Architecture Advances the Observability Mission&lt;/h2>
&lt;p>Through extensive development and testing, I&amp;rsquo;ve learned tremendously about building production-ready AI systems and have implemented a novel InfoAgent architecture that orchestrates our specialized agents:&lt;/p>
&lt;h3 id="1-prometheus-metrics-analysis-agent">1. Prometheus Metrics Analysis Agent&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Continuously ingests and processes NRP&amp;rsquo;s Prometheus metrics&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: Fully implemented data pipelines handling multiple metric types with optimized latency&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Provides the foundation for anomaly detection by establishing normal behavior baselines&lt;/li>
&lt;/ul>
&lt;h3 id="2-query-refinement-agent-croq">2. Query Refinement Agent (CROQ)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Clarifies ambiguous metrics or patterns before generating explanations&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: Completed implementation of Conformal Revision of Questions for disambiguation&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Ensures explanations address the right system behaviors (e.g., distinguishing CPU saturation from memory pressure)&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: Successfully improved accuracy of GenAI explanations by eliminating misinterpretations&lt;/li>
&lt;/ul>
&lt;h3 id="3-explanation-generation-agent-ais">3. Explanation Generation Agent (AIS)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Creates human-readable explanations and root-cause analysis&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: Finalized the Automated Information Seeker with a complete Plan→Validate→Execute→Assess→Revise cycle&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Transforms technical anomalies into actionable insights for operators&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: Delivers GenAI explanations with uncertainty quantification&lt;/li>
&lt;/ul>
&lt;h2 id="completed-integration-the-novel-infoagent-pipeline">Completed Integration: The Novel InfoAgent Pipeline&lt;/h2>
&lt;p>We&amp;rsquo;ve successfully integrated all agents into a unified observability pipeline that represents our novel contribution:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Data Collection&lt;/strong>: Prometheus metrics → Analysis Agent (comprehensive metrics support)&lt;/li>
&lt;li>&lt;strong>Anomaly Detection&lt;/strong>: With statistical confidence bounds using conformal prediction&lt;/li>
&lt;li>&lt;strong>Query Refinement&lt;/strong>: Resolving ambiguities before explanation&lt;/li>
&lt;li>&lt;strong>Explanation Generation&lt;/strong>: Human-readable analysis with uncertainty awareness&lt;/li>
&lt;li>&lt;strong>Feedback Loop&lt;/strong>: System learning from operator interactions (implemented and tested)&lt;/li>
&lt;/ol>
&lt;h2 id="hardware-testing-results">Hardware Testing Results&lt;/h2>
&lt;p>This project taught me valuable lessons about optimizing AI workloads on specialized hardware. We successfully tested our observability framework on Qualcomm Cloud AI 100 Ultra hardware:&lt;/p>
&lt;ul>
&lt;li>Achieved significant performance improvements over baseline CPU implementation&lt;/li>
&lt;li>Successfully ported and optimized GLM-4.5 for observability-specific tasks&lt;/li>
&lt;li>Validated that specialized AI hardware significantly enhances real-time anomaly detection&lt;/li>
&lt;/ul>
&lt;h2 id="learning-journey-and-novel-contributions">Learning Journey and Novel Contributions&lt;/h2>
&lt;p>Throughout OSRE 2025, I&amp;rsquo;ve learned extensively about:&lt;/p>
&lt;ol>
&lt;li>Building hierarchical agent coordination systems for complex reasoning&lt;/li>
&lt;li>Implementing conformal prediction for trustworthy AI outputs&lt;/li>
&lt;li>Creating self-correcting explanation pipelines&lt;/li>
&lt;li>Developing adaptive learning systems from operator feedback&lt;/li>
&lt;/ol>
&lt;p>The novel InfoAgent architecture demonstrates promising results in our testing environment, with evaluation metrics and benchmarks still being refined as work in progress.&lt;/p>
&lt;h2 id="ongoing-work-continuing-beyond-osre">Ongoing Work: Continuing Beyond OSRE&lt;/h2>
&lt;p>While OSRE 2025 is concluding, I&amp;rsquo;m actively continuing to contribute to this project:&lt;/p>
&lt;ol>
&lt;li>Preparing the InfoAgent framework for open-source release with comprehensive documentation&lt;/li>
&lt;li>Running extended evaluation tests on the Nautilus platform (work in progress)&lt;/li>
&lt;li>Writing a research paper detailing our novel architecture&lt;/li>
&lt;li>Creating tutorials to help others implement intelligent observability&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Project Updates and Code&lt;/strong>: You can follow my ongoing contributions and access the latest code at &lt;a href="https://mreddy10.pages.nrp-nautilus.io/gsocnrp/" target="_blank" rel="noopener">https://mreddy10.pages.nrp-nautilus.io/gsocnrp/&lt;/a>&lt;/p>
&lt;h2 id="acknowledgments">Acknowledgments&lt;/h2>
&lt;p>I&amp;rsquo;m deeply grateful to my lead mentor &lt;strong>Mohammad Firas Sada&lt;/strong> for his exceptional guidance throughout this transformative learning experience. His insights have been invaluable in helping me develop the novel InfoAgent architecture and navigate the complexities of building production-ready AI systems.&lt;/p>
&lt;p>The OSRE 2025 program has been an incredible journey of growth and discovery. I&amp;rsquo;ve learned not just how to build AI systems, but how to make them trustworthy, explainable, and genuinely useful for real-world operations. The novel InfoAgent architecture we&amp;rsquo;ve developed serves the original mission: creating an intelligent observability tool that helps NRP operators solve problems faster and keep complex research systems running smoothly.&lt;/p>
&lt;p>I&amp;rsquo;m excited to continue contributing to this project and look forward to seeing how the community adopts and extends these ideas. Check out my contributions and ongoing updates at &lt;a href="https://mreddy10.pages.nrp-nautilus.io/gsocnrp/" target="_blank" rel="noopener">https://mreddy10.pages.nrp-nautilus.io/gsocnrp/&lt;/a>!&lt;/p></description></item><item><title>Midterm Update: Building Intelligent Observability for NRP</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250801-manish-reddy/</link><pubDate>Fri, 01 Aug 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250801-manish-reddy/</guid><description>&lt;p>I&amp;rsquo;m pleased to share the progress we&amp;rsquo;ve made on my OSRE 2025 project, &amp;ldquo;&lt;em>Intelligent Observability for Seam: A GenAI Approach&lt;/em>&amp;rdquo; since my initial announcement. We&amp;rsquo;re working toward our core goal: building an ML-powered service for NRP that analyzes monitoring data, detects anomalies, and provides trustworthy GenAI explanations.&lt;/p>
&lt;h2 id="how-our-agents-support-the-observability-mission">How Our Agents Support the Observability Mission&lt;/h2>
&lt;p>We&amp;rsquo;ve been developing specialized agents and tools that work together to support our original project vision:&lt;/p>
&lt;h3 id="1-prometheus-metrics-analysis-agent">1. Prometheus Metrics Analysis Agent&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Continuously ingests and processes NRP&amp;rsquo;s Prometheus metrics&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: We&amp;rsquo;ve implemented initial data pipelines for key system metrics&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Provides the foundation for anomaly detection by establishing normal behavior baselines&lt;/li>
&lt;/ul>
&lt;h3 id="2-query-refinement-agent-croq">2. Query Refinement Agent (CROQ)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Clarifies ambiguous metrics or patterns before generating explanations&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: We&amp;rsquo;ve implemented a basic version of Conformal Revision of Questions to resolve metric ambiguities&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Aims to ensure explanations address the right system behaviors (e.g., distinguishing CPU saturation from memory pressure)&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: We hope this will improve accuracy of GenAI explanations by eliminating misinterpretations&lt;/li>
&lt;/ul>
&lt;h3 id="3-explanation-generation-agent-ais">3. Explanation Generation Agent (AIS)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Creates human-readable explanations and root-cause analysis&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: We&amp;rsquo;ve built a prototype of the Automated Information Seeker with a Plan→Validate→Execute→Assess→Revise cycle&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Transforms technical anomalies into actionable insights for operators&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: Intended to directly deliver on the GenAI explanation component of our tool&lt;/li>
&lt;/ul>
&lt;h2 id="integration-progress">Integration Progress&lt;/h2>
&lt;p>We&amp;rsquo;re working to connect our agents into a unified observability pipeline:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Data Collection&lt;/strong>: Prometheus metrics → Analysis Agent&lt;/li>
&lt;li>&lt;strong>Anomaly Detection&lt;/strong>: With statistical confidence bounds (in development)&lt;/li>
&lt;li>&lt;strong>Query Refinement&lt;/strong>: Resolving ambiguities before explanation&lt;/li>
&lt;li>&lt;strong>Explanation Generation&lt;/strong>: Human-readable analysis with uncertainty awareness&lt;/li>
&lt;li>&lt;strong>Feedback Loop&lt;/strong>: System learning from operator interactions (planned)&lt;/li>
&lt;/ol>
&lt;h2 id="hardware-testing-opportunity">Hardware Testing Opportunity&lt;/h2>
&lt;p>This project has given us a valuable opportunity to test our observability framework on Qualcomm Cloud AI 100 Ultra hardware. We&amp;rsquo;re beginning to port different LLM architectures specifically for:&lt;/p>
&lt;ul>
&lt;li>Exploring anomaly detection performance on specialized AI hardware&lt;/li>
&lt;li>Testing explanation generation quality across different model architectures&lt;/li>
&lt;li>Comparing GLM-4.5 against other models for observability-specific tasks&lt;/li>
&lt;/ul>
&lt;h2 id="next-phase-completing-the-observability-tool">Next Phase: Completing the Observability Tool&lt;/h2>
&lt;p>For the remainder of OSRE 2025, we&amp;rsquo;re focused on:&lt;/p>
&lt;ol>
&lt;li>Finalizing integration of all agents into a cohesive anomaly detection tool with matrix&lt;/li>
&lt;li>Validating that our GenAI explanations help operators resolve issues faster for users, which we plan to test on the nautilus matrix platform&lt;/li>
&lt;li>Optimizing performance on specialized hardware for NRP&amp;rsquo;s scale&lt;/li>
&lt;li>Preparing the open-source release of our intelligent observability tool&lt;/li>
&lt;/ol>
&lt;h2 id="acknowledgments">Acknowledgments&lt;/h2>
&lt;p>I&amp;rsquo;m deeply grateful to my lead mentor &lt;strong>Mohammad Firas Sada&lt;/strong> for his guidance in keeping our work focused on NRP&amp;rsquo;s observability needs. His insights have been invaluable in navigating the challenges of this project.&lt;/p>
&lt;p>While we&amp;rsquo;ve developed several agents and frameworks, everything we&amp;rsquo;re building serves the original mission: creating an intelligent observability tool that helps NRP operators solve problems faster and keep complex research systems running smoothly.&lt;/p>
&lt;p>I look forward to sharing more progress on our observability tool with GenAI explanations in the coming weeks!&lt;/p></description></item></channel></rss>