<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Manish K Reddy | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/manish-k-reddy/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/manish-k-reddy/index.xml" rel="self" type="application/rss+xml"/><description>Manish K Reddy</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/manish-k-reddy/avatar_hu4b304da5b7fd4120ba9ee1d016ae6d82_1887962_270x270_fill_q75_lanczos_center.jpg</url><title>Manish K Reddy</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/manish-k-reddy/</link></image><item><title>Final Update: Building Intelligent Observability for NRP</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250925-manish-reddy/</link><pubDate>Thu, 25 Sep 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250925-manish-reddy/</guid><description>&lt;p>I&amp;rsquo;m excited to share the completion of my OSRE 2025 project, &amp;ldquo;&lt;em>Intelligent Observability for NRP: A GenAI Approach&lt;/em>&amp;rdquo; and the significant learning journey it has been. We&amp;rsquo;ve successfully developed a novel InfoAgent architecture that delivers on our core goal: building an ML-powered service for NRP that analyzes monitoring data, detects anomalies, and provides trustworthy GenAI explanations.&lt;/p>
&lt;h2 id="how-our-novel-infoagent-architecture-advances-the-observability-mission">How Our Novel InfoAgent Architecture Advances the Observability Mission&lt;/h2>
&lt;p>Through extensive development and testing, I&amp;rsquo;ve learned tremendously about building production-ready AI systems and have implemented a novel InfoAgent architecture that orchestrates our specialized agents:&lt;/p>
&lt;h3 id="1-prometheus-metrics-analysis-agent">1. Prometheus Metrics Analysis Agent&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Continuously ingests and processes NRP&amp;rsquo;s Prometheus metrics&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: Fully implemented data pipelines handling multiple metric types with optimized latency&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Provides the foundation for anomaly detection by establishing normal behavior baselines&lt;/li>
&lt;/ul>
&lt;h3 id="2-query-refinement-agent-croq">2. Query Refinement Agent (CROQ)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Clarifies ambiguous metrics or patterns before generating explanations&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: Completed implementation of Conformal Revision of Questions for disambiguation&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Ensures explanations address the right system behaviors (e.g., distinguishing CPU saturation from memory pressure)&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: Successfully improved accuracy of GenAI explanations by eliminating misinterpretations&lt;/li>
&lt;/ul>
&lt;h3 id="3-explanation-generation-agent-ais">3. Explanation Generation Agent (AIS)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Creates human-readable explanations and root-cause analysis&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: Finalized the Automated Information Seeker with a complete Plan→Validate→Execute→Assess→Revise cycle&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Transforms technical anomalies into actionable insights for operators&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: Delivers GenAI explanations with uncertainty quantification&lt;/li>
&lt;/ul>
&lt;h2 id="completed-integration-the-novel-infoagent-pipeline">Completed Integration: The Novel InfoAgent Pipeline&lt;/h2>
&lt;p>We&amp;rsquo;ve successfully integrated all agents into a unified observability pipeline that represents our novel contribution:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Data Collection&lt;/strong>: Prometheus metrics → Analysis Agent (comprehensive metrics support)&lt;/li>
&lt;li>&lt;strong>Anomaly Detection&lt;/strong>: With statistical confidence bounds using conformal prediction&lt;/li>
&lt;li>&lt;strong>Query Refinement&lt;/strong>: Resolving ambiguities before explanation&lt;/li>
&lt;li>&lt;strong>Explanation Generation&lt;/strong>: Human-readable analysis with uncertainty awareness&lt;/li>
&lt;li>&lt;strong>Feedback Loop&lt;/strong>: System learning from operator interactions (implemented and tested)&lt;/li>
&lt;/ol>
&lt;h2 id="hardware-testing-results">Hardware Testing Results&lt;/h2>
&lt;p>This project taught me valuable lessons about optimizing AI workloads on specialized hardware. We successfully tested our observability framework on Qualcomm Cloud AI 100 Ultra hardware:&lt;/p>
&lt;ul>
&lt;li>Achieved significant performance improvements over baseline CPU implementation&lt;/li>
&lt;li>Successfully ported and optimized GLM-4.5 for observability-specific tasks&lt;/li>
&lt;li>Validated that specialized AI hardware significantly enhances real-time anomaly detection&lt;/li>
&lt;/ul>
&lt;h2 id="learning-journey-and-novel-contributions">Learning Journey and Novel Contributions&lt;/h2>
&lt;p>Throughout OSRE 2025, I&amp;rsquo;ve learned extensively about:&lt;/p>
&lt;ol>
&lt;li>Building hierarchical agent coordination systems for complex reasoning&lt;/li>
&lt;li>Implementing conformal prediction for trustworthy AI outputs&lt;/li>
&lt;li>Creating self-correcting explanation pipelines&lt;/li>
&lt;li>Developing adaptive learning systems from operator feedback&lt;/li>
&lt;/ol>
&lt;p>The novel InfoAgent architecture demonstrates promising results in our testing environment, with evaluation metrics and benchmarks still being refined as work in progress.&lt;/p>
&lt;h2 id="ongoing-work-continuing-beyond-osre">Ongoing Work: Continuing Beyond OSRE&lt;/h2>
&lt;p>While OSRE 2025 is concluding, I&amp;rsquo;m actively continuing to contribute to this project:&lt;/p>
&lt;ol>
&lt;li>Preparing the InfoAgent framework for open-source release with comprehensive documentation&lt;/li>
&lt;li>Running extended evaluation tests on the Nautilus platform (work in progress)&lt;/li>
&lt;li>Writing a research paper detailing our novel architecture&lt;/li>
&lt;li>Creating tutorials to help others implement intelligent observability&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Project Updates and Code&lt;/strong>: You can follow my ongoing contributions and access the latest code at &lt;a href="https://mreddy10.pages.nrp-nautilus.io/gsocnrp/" target="_blank" rel="noopener">https://mreddy10.pages.nrp-nautilus.io/gsocnrp/&lt;/a>&lt;/p>
&lt;h2 id="acknowledgments">Acknowledgments&lt;/h2>
&lt;p>I&amp;rsquo;m deeply grateful to my lead mentor &lt;strong>Mohammad Firas Sada&lt;/strong> for his exceptional guidance throughout this transformative learning experience. His insights have been invaluable in helping me develop the novel InfoAgent architecture and navigate the complexities of building production-ready AI systems.&lt;/p>
&lt;p>The OSRE 2025 program has been an incredible journey of growth and discovery. I&amp;rsquo;ve learned not just how to build AI systems, but how to make them trustworthy, explainable, and genuinely useful for real-world operations. The novel InfoAgent architecture we&amp;rsquo;ve developed serves the original mission: creating an intelligent observability tool that helps NRP operators solve problems faster and keep complex research systems running smoothly.&lt;/p>
&lt;p>I&amp;rsquo;m excited to continue contributing to this project and look forward to seeing how the community adopts and extends these ideas. Check out my contributions and ongoing updates at &lt;a href="https://mreddy10.pages.nrp-nautilus.io/gsocnrp/" target="_blank" rel="noopener">https://mreddy10.pages.nrp-nautilus.io/gsocnrp/&lt;/a>!&lt;/p></description></item><item><title>Midterm Update: Building Intelligent Observability for NRP</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250801-manish-reddy/</link><pubDate>Fri, 01 Aug 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250801-manish-reddy/</guid><description>&lt;p>I&amp;rsquo;m pleased to share the progress we&amp;rsquo;ve made on my OSRE 2025 project, &amp;ldquo;&lt;em>Intelligent Observability for Seam: A GenAI Approach&lt;/em>&amp;rdquo; since my initial announcement. We&amp;rsquo;re working toward our core goal: building an ML-powered service for NRP that analyzes monitoring data, detects anomalies, and provides trustworthy GenAI explanations.&lt;/p>
&lt;h2 id="how-our-agents-support-the-observability-mission">How Our Agents Support the Observability Mission&lt;/h2>
&lt;p>We&amp;rsquo;ve been developing specialized agents and tools that work together to support our original project vision:&lt;/p>
&lt;h3 id="1-prometheus-metrics-analysis-agent">1. Prometheus Metrics Analysis Agent&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Continuously ingests and processes NRP&amp;rsquo;s Prometheus metrics&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: We&amp;rsquo;ve implemented initial data pipelines for key system metrics&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Provides the foundation for anomaly detection by establishing normal behavior baselines&lt;/li>
&lt;/ul>
&lt;h3 id="2-query-refinement-agent-croq">2. Query Refinement Agent (CROQ)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Clarifies ambiguous metrics or patterns before generating explanations&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: We&amp;rsquo;ve implemented a basic version of Conformal Revision of Questions to resolve metric ambiguities&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Aims to ensure explanations address the right system behaviors (e.g., distinguishing CPU saturation from memory pressure)&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: We hope this will improve accuracy of GenAI explanations by eliminating misinterpretations&lt;/li>
&lt;/ul>
&lt;h3 id="3-explanation-generation-agent-ais">3. Explanation Generation Agent (AIS)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Creates human-readable explanations and root-cause analysis&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: We&amp;rsquo;ve built a prototype of the Automated Information Seeker with a Plan→Validate→Execute→Assess→Revise cycle&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Transforms technical anomalies into actionable insights for operators&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: Intended to directly deliver on the GenAI explanation component of our tool&lt;/li>
&lt;/ul>
&lt;h2 id="integration-progress">Integration Progress&lt;/h2>
&lt;p>We&amp;rsquo;re working to connect our agents into a unified observability pipeline:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Data Collection&lt;/strong>: Prometheus metrics → Analysis Agent&lt;/li>
&lt;li>&lt;strong>Anomaly Detection&lt;/strong>: With statistical confidence bounds (in development)&lt;/li>
&lt;li>&lt;strong>Query Refinement&lt;/strong>: Resolving ambiguities before explanation&lt;/li>
&lt;li>&lt;strong>Explanation Generation&lt;/strong>: Human-readable analysis with uncertainty awareness&lt;/li>
&lt;li>&lt;strong>Feedback Loop&lt;/strong>: System learning from operator interactions (planned)&lt;/li>
&lt;/ol>
&lt;h2 id="hardware-testing-opportunity">Hardware Testing Opportunity&lt;/h2>
&lt;p>This project has given us a valuable opportunity to test our observability framework on Qualcomm Cloud AI 100 Ultra hardware. We&amp;rsquo;re beginning to port different LLM architectures specifically for:&lt;/p>
&lt;ul>
&lt;li>Exploring anomaly detection performance on specialized AI hardware&lt;/li>
&lt;li>Testing explanation generation quality across different model architectures&lt;/li>
&lt;li>Comparing GLM-4.5 against other models for observability-specific tasks&lt;/li>
&lt;/ul>
&lt;h2 id="next-phase-completing-the-observability-tool">Next Phase: Completing the Observability Tool&lt;/h2>
&lt;p>For the remainder of OSRE 2025, we&amp;rsquo;re focused on:&lt;/p>
&lt;ol>
&lt;li>Finalizing integration of all agents into a cohesive anomaly detection tool with matrix&lt;/li>
&lt;li>Validating that our GenAI explanations help operators resolve issues faster for users, which we plan to test on the nautilus matrix platform&lt;/li>
&lt;li>Optimizing performance on specialized hardware for NRP&amp;rsquo;s scale&lt;/li>
&lt;li>Preparing the open-source release of our intelligent observability tool&lt;/li>
&lt;/ol>
&lt;h2 id="acknowledgments">Acknowledgments&lt;/h2>
&lt;p>I&amp;rsquo;m deeply grateful to my lead mentor &lt;strong>Mohammad Firas Sada&lt;/strong> for his guidance in keeping our work focused on NRP&amp;rsquo;s observability needs. His insights have been invaluable in navigating the challenges of this project.&lt;/p>
&lt;p>While we&amp;rsquo;ve developed several agents and frameworks, everything we&amp;rsquo;re building serves the original mission: creating an intelligent observability tool that helps NRP operators solve problems faster and keep complex research systems running smoothly.&lt;/p>
&lt;p>I look forward to sharing more progress on our observability tool with GenAI explanations in the coming weeks!&lt;/p></description></item><item><title>Kicking Off Intelligent Observability for Seam: My OSRE 2025 Journey</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250614-manish-reddy/</link><pubDate>Sat, 14 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250614-manish-reddy/</guid><description>&lt;p>Hi! I’m &lt;strong>Manish K Reddy&lt;/strong> (&lt;a href="https://github.com/kredd2506" target="_blank" rel="noopener">@kredd2506&lt;/a>), a graduate student based in the United States, and I’m excited to join the OSRE 2025 cohort. This summer, I’ll be working with the &lt;a href="https://www.sdsc.edu/" target="_blank" rel="noopener">San Diego Supercomputer Center (SDSC)&lt;/a> and the &lt;a href="https://nrp.ai/" target="_blank" rel="noopener">National Research Platform (NRP)&lt;/a> on a project that blends my interests in machine learning, cloud systems, and real-world impact.&lt;/p>
&lt;p>The &lt;a href="https://nrp.ai/" target="_blank" rel="noopener">National Research Platform (NRP)&lt;/a> has moved beyond its original vision as a “ScienceDMZ data freeway” and evolved into a distributed cloud supercomputer, empowering research and education across more than 50 institutions. SDSC, located at UC San Diego, is recognized internationally for driving innovation in data, supercomputing, and advanced cyberinfrastructure.&lt;/p>
&lt;p>&lt;strong>My project&lt;/strong>, &amp;ldquo;&lt;em>Intelligent Observability for Seam: A GenAI Approach&lt;/em>&amp;rdquo; focuses on building an ML-powered service for NRP. The goal is to analyze monitoring data (starting with Prometheus metrics), automatically detect anomalies, and use generative AI (GenAI) for human-readable explanations and root-cause analysis. This will help researchers and operators solve problems faster and keep complex research systems running smoothly.&lt;/p>
&lt;p>I am especially grateful to my lead mentor &lt;a href="https://ucsc-ospo.github.io/author/mohammad-firas-sada/" target="_blank" rel="noopener">Mohammad Firas Sada&lt;/a>, who is personally guiding me throughout this project. I also want to thank Jeffrey Weekley and Derek Weitzel for their support and guidance.&lt;br>
You can read my &lt;a href="https://summerofcode.withgoogle.com/media/user/e7a9ade92bcf/proposal/gAAAAABoTeP59B2JlNoLcurxCTBvCS0T9by5Tv8ce1Hs6PB629g9rgzeb_8UrJTZfgpdagnHs5NjUtyYlanFb99wPxpTWjWSgwwToS5qh5u_YUfp9p6IzyE=.pdf" target="_blank" rel="noopener">initial proposal here (PDF)&lt;/a>.&lt;/p>
&lt;hr>
&lt;h3 id="genai-driven-observability-for-nrp">GenAI-Driven Observability for NRP&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Machine Learning, Observability, DevOps, High Performance Computing, LLMs, GenAI, Distributed Systems&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, Prometheus, Docker, Kubernetes, FastAPI, PyTorch, Pandas, LLM APIs, scikit-learn, PostgreSQL&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350 hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> Mohammad Firas Sada, Jeffrey Weekley, Derek Weitzel&lt;/li>
&lt;/ul>
&lt;hr>
&lt;p>This summer, I’m looking forward to:&lt;/p>
&lt;ul>
&lt;li>Delivering an open-source anomaly detection tool for NRP&lt;/li>
&lt;li>Building GenAI features for better explanations and root-cause analysis&lt;/li>
&lt;li>Learning from my mentors and contributing to a vibrant open science community&lt;/li>
&lt;/ul>
&lt;p>Thanks for reading, and I’m looking forward to sharing my journey and progress in the coming weeks!&lt;/p>
&lt;hr></description></item></channel></rss>