<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>data science | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/data-science/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/data-science/index.xml" rel="self" type="application/rss+xml"/><description>data science</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Fri, 30 Jan 2026 10:15:00 -0700</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>data science</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/data-science/</link></image><item><title>AI Data Readiness Inspector (AIDRIN)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/lbl/aidrin/</link><pubDate>Fri, 30 Jan 2026 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/lbl/aidrin/</guid><description>&lt;p>Garbage In, Garbage Out (GIGO) is a widely accepted quote in computer science across various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.&lt;/p>
&lt;p>&lt;a href="https://arxiv.org/pdf/2406.19256" target="_blank" rel="noopener">AIDRIN&lt;/a> (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of data readiness for AI processes, covering a broad range of dimensions from the literature. AIDRIN uses metrics from traditional data quality assessment, such as completeness, outliers, and duplicates, to evaluate data. Furthermore, AIDRIN uses metrics specific to assessing AI data, such as feature importance, feature correlations, class imbalance, fairness, privacy, and compliance with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles. AIDRIN provides visualizations and reports to assist data scientists in further investigating data readiness.&lt;/p>
&lt;h3 id="aidrin-multiple-file-formats">AIDRIN Multiple File Formats&lt;/h3>
&lt;p>The proposed work will include improvements in the AIDRIN framework to (1) add support for new file formats such as Zarr, ROOT, and HDF5; and (2) to allow providing custom data ingestion mechanisms.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>data readiness&lt;/code>, &lt;code>AI&lt;/code>, &lt;code>data analysis&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, data analysis, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/suren-byna/">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Drishti</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/lbl/drishti/</link><pubDate>Fri, 30 Jan 2026 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/lbl/drishti/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/drishti" target="_blank" rel="noopener">Drishti&lt;/a> is a novel interactive web-based analysis framework to visualize I/O traces, highlight bottlenecks, and help understand the I/O behavior of scientific applications. Drishti aims to fill the gap between the trace collection, analysis, and tuning phases. The framework contains an interactive I/O trace analysis component for end-users to visually inspect their applications&amp;rsquo; I/O behavior, focusing on areas of interest and getting a clear picture of common root causes of I/O performance bottlenecks. Based on the automatic detection of I/O performance bottlenecks, our framework maps numerous common and well-known bottlenecks and their solution recommendations that can be implemented by users.&lt;/p>
&lt;h3 id="drishti-comparisons-and-heatmaps">Drishti Comparisons and Heatmaps&lt;/h3>
&lt;p>The proposed work will include investigating and building a solution to allow comparing and finding differences between two I/O trace files (similar to a &lt;code>diff&lt;/code>), covering the analysis and visualization components. It will also explore additional metrics and counters such as Darshan heatmaps in the analysis and visualization components of the framework.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code>, &lt;code>HPC&lt;/code>, &lt;code>data analysis&lt;/code>, &lt;code>visualization&lt;/code>, &lt;code>profiling&lt;/code>, &lt;code>tracing&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, data analysis, performance profiling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Lynx Grader</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/ucsc/autograder/</link><pubDate>Tue, 13 Jan 2026 13:00:00 -0800</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/ucsc/autograder/</guid><description>&lt;p>The &lt;a href="https://github.com/edulinq/autograder-server" target="_blank" rel="noopener">EduLinq Lynx Grader&lt;/a> (also referred to as &amp;ldquo;autograder&amp;rdquo;) is an open source tool used by several courses at UCSC
to safely and quickly grade programming assignments.
Grading student code is something that may seem simple at first (you just need to run their code!),
but quickly becomes exceeding complex as you get more into the details.
Specifically, grading a student&amp;rsquo;s code securely while providing the &amp;ldquo;last mile&amp;rdquo; service of getting code from students
and sending results to instructors/TAs and the course&amp;rsquo;s LMS (e.g., Canvas) can be very difficult.
The Lynx Grader provides all of this in a free and open source project.
The &lt;a href="https://linqs.org" target="_blank" rel="noopener">LINQS Lab&lt;/a> has made many contributions to the maintain and improve the Lynx Grader.&lt;/p>
&lt;p>As an open source project, there are endless opportunities for development, improvements, and collaboration.
Here, we highlight some specific projects that will work well in the summer mentorship setting.&lt;/p>
&lt;p>All students interested in LINQS projects for OSRE/GSoC 2026 should fill out &lt;a href="https://forms.gle/Mr4YR3N35pWDb4uz7" target="_blank" rel="noopener">this form&lt;/a>.
Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project.
The form will stop accepting responses once the application window closes.
Do not post on any of the project repositories about OSRE/GSoC
(e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2026).
Remember, these are active repositories that were not created for OSRE/GSoC.&lt;/p>
&lt;h3 id="llm-detection">LLM Detection&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>AI/ML&lt;/code> &lt;code>LLM&lt;/code> &lt;code>Research&lt;/code> &lt;code>Backend&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> software development, backend, systems, data munging, go, docker&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:linqs.osre26@gmail.com">Eriq Augustine&lt;/a>, &lt;a href="mailto:linqs.osre26@gmail.com">Fabrice Kurmann&lt;/a>, &lt;a href="mailto:linqs.osre26@gmail.com">Lise Getoor&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>As &lt;a href="https://en.wikipedia.org/wiki/Large_language_model" target="_blank" rel="noopener">Large Language Model (LLM)&lt;/a> tools like ChatGPT become more common and powerful,
instructors need tools to help determine if students are the actual authors of the code they submit.
More classical instances of plagiarism are often discovered by code similarity tools like &lt;a href="https://theory.stanford.edu/~aiken/moss/" target="_blank" rel="noopener">MOSS&lt;/a>.
However these tools are not sufficient for detecting code written not by a student,
but by an AI model like &lt;a href="https://en.wikipedia.org/wiki/ChatGPT" target="_blank" rel="noopener">ChatGPT&lt;/a> or &lt;a href="https://en.wikipedia.org/wiki/GitHub_Copilot" target="_blank" rel="noopener">GitHub Copilot&lt;/a>.&lt;/p>
&lt;p>The task for this project is to create a system that provides a score indicating the system&amp;rsquo;s confidence that a given piece of code was written by an AI tool and not a student.
This will supplement the existing code analysis tools in the Lynx Grader.
There are many approaches to completing this task that will be considered.
A more software development approach can consist of levering exiting systems to create a production-ready system,
whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.&lt;/p>
&lt;p>There has been &lt;a href="https://github.com/anvichip/AI-code-detection-ML/blob/main/experiment/report.md" target="_blank" rel="noopener">previous work on this issue&lt;/a>,
where a student did a survey of existing solutions, collection of initial datasets, and exploratory experiments on possible directions.
This project would build off of this previous work.&lt;/p>
&lt;p>See Also:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server" target="_blank" rel="noopener">Repository for Lynx Grader Server&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/issues/140" target="_blank" rel="noopener">GitHub Issue&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="code-analysis-gui">Code Analysis GUI&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Frontend&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> software development, frontend, data munging, js, css, go&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Easy&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:linqs.osre26@gmail.com">Eriq Augustine&lt;/a>, &lt;a href="mailto:linqs.osre26@gmail.com">Fabrice Kurmann&lt;/a>, &lt;a href="mailto:linqs.osre26@gmail.com">Lise Getoor&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The Lynx Grader has existing functionality to analyze the code in a student&amp;rsquo;s submission for malicious content.
Relevant to this project is that the Lynx Grader can run a pairwise similarity analysis against all submitted code.
This is how most existing software plagiarism systems detect offending code.
The existing infrastructure provides detailed statistics on code similarity,
but does not currently have a visual way to display this data.&lt;/p>
&lt;p>The task for this project is to create a web GUI using the Lynx Grader REST API
to display the results of a code analysis.
The size of this project depends on how many of the existing features are going to be supported by the web GUI.&lt;/p>
&lt;p>See Also:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-web" target="_blank" rel="noopener">Repository for Lynx Grader Web GUI&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/issues/142" target="_blank" rel="noopener">GitHub Issue&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/blob/main/internal/model/analysis.go#L78" target="_blank" rel="noopener">Pairwise Code Analysis Type&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-py/blob/v0.6.16/tests/api/testdata/courses/assignments/analysis/courses_assignments_submissions_analysis_pairwise_wait.json" target="_blank" rel="noopener">Sample API Data&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="web-gui">Web GUI&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Frontend&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> software development, frontend, js, css&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Easy&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:linqs.osre26@gmail.com">Eriq Augustine&lt;/a>, &lt;a href="mailto:linqs.osre26@gmail.com">Fabrice Kurmann&lt;/a>, &lt;a href="mailto:linqs.osre26@gmail.com">Lise Getoor&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The Lynx Grader contains dozens of &lt;a href="https://github.com/edulinq/autograder-server/blob/main/resources/api.json" target="_blank" rel="noopener">API endpoints&lt;/a>,
most directly representing a piece of functionality exposed to the user.
All of these features are exposed in the &lt;a href="https://github.com/edulinq/autograder-py" target="_blank" rel="noopener">Lynx Grader&amp;rsquo;s Python Interface&lt;/a>.
However, the Python interface is a purely command-line interface.
And although command-line interface are objectively (read: subjectively) the best,
a web GUI would be more accessible to a wider audience.
The autograder already has a &lt;a href="https://github.com/edulinq/autograder-web" target="_blank" rel="noopener">web GUI&lt;/a>,
but it does not cover all the features available in the Lynx Grader.&lt;/p>
&lt;p>The task for this project is to augment the Lynx Grader&amp;rsquo;s web GUI with more features.
Specifically, add support for more tools used to create and administer courses.&lt;/p>
&lt;p>See Also:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-web" target="_blank" rel="noopener">Repository for Lynx Grader Web GUI&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/issues/61" target="_blank" rel="noopener">GitHub Issue&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/blob/main/resources/api.json" target="_blank" rel="noopener">Lynx Grader API Endpoints&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-py" target="_blank" rel="noopener">Lynx Grader&amp;rsquo;s Python Interface&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>AI Data Readiness Inspector (AIDRIN)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/aidrin/</link><pubDate>Tue, 11 Feb 2025 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/aidrin/</guid><description>&lt;p>Garbage In Garbage Out (GIGO) is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.&lt;/p>
&lt;p>&lt;a href="https://arxiv.org/pdf/2406.19256" target="_blank" rel="noopener">AIDRIN&lt;/a> (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of the readiness of data for AI processes, covering a broad range of readiness dimensions available in the literature. AIDRIN uses metrics in traditional data quality assessment, such as completeness, outliers, and duplicates, for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data.&lt;/p>
&lt;h3 id="aidrin-visualizations-and-science-gateway">AIDRIN Visualizations and Science Gateway&lt;/h3>
&lt;p>The proposed work will include improvements in the AIDRIN framework to (1) enhance, extend, and optimize the visualizations of metrics related to all six pillars of AI data readiness and (2) set up a science gateway on NERSC or AWS cloud service.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>data readiness&lt;/code> &lt;code>AI&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/suren-byna/">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>h5bench with AI workloads</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/h5bench-ai/</link><pubDate>Tue, 11 Feb 2025 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/h5bench-ai/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/h5bench" target="_blank" rel="noopener">h5bench&lt;/a> is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.&lt;/p>
&lt;p>Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library&amp;rsquo;s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.&lt;/p>
&lt;h3 id="h5bench-with-ai-workloads">h5bench with AI workloads&lt;/h3>
&lt;p>The proposed work will include (1) analyzing and characterizing AI workloads that rely on HDF5 datasets, (2) extracting a kernel of their I/O operations, and (3) implementing and validating the kernel in h5bench.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>benchmarking&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/suren-byna/">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Autograder</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/autograder/</link><pubDate>Thu, 06 Feb 2025 13:00:00 -0800</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/autograder/</guid><description>&lt;p>The &lt;a href="https://github.com/edulinq/autograder-server" target="_blank" rel="noopener">EduLinq Autograder&lt;/a> is an open source tool used by several courses at UCSC
to safely and quickly grade programming assignments.
Grading student code is something that may seem simple at first (you just need to run their code!),
but quickly becomes exceeding complex as you get more into the details.
Specifically, grading a student&amp;rsquo;s code securely while providing the &amp;ldquo;last mile&amp;rdquo; service of getting code from students
and sending results to instructors/TAs and the course&amp;rsquo;s LMS (e.g., Canvas) can be very difficult.
The Autograder provides all of this in a free and open source project.
The &lt;a href="https://linqs.org" target="_blank" rel="noopener">LINQS Lab&lt;/a> has made many contributions to the maintain and improve the Autograder.&lt;/p>
&lt;p>As an open source project, there are endless opportunities for development, improvements, and collaboration.
Here, we highlight some specific projects that will work well in the summer mentorship setting.&lt;/p>
&lt;p>All students interested in LINQS projects for OSRE/GSoC 2025 should fill out &lt;a href="https://forms.gle/RxGqnQiCDeHSX6tq6" target="_blank" rel="noopener">this form&lt;/a>.
Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project.
The form will stop accepting responses once the application window closes.
Do not post on any of the project repositories about OSRE/GSoC
(e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2025).
Remember, these are active repositories that were not created for OSRE/GSoC.&lt;/p>
&lt;h3 id="llm-detection">LLM Detection&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>AI/ML&lt;/code> &lt;code>LLM&lt;/code> &lt;code>Research&lt;/code> &lt;code>Backend&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> software development, backend, systems, data munging, go, docker&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:linqs.osre25@gmail.com">Eriq Augustine&lt;/a>, &lt;a href="mailto:linqs.osre25@gmail.com">Fabrice Kurmann&lt;/a>, &lt;a href="mailto:linqs.osre25@gmail.com">Lise Getoor&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>As &lt;a href="https://en.wikipedia.org/wiki/Large_language_model" target="_blank" rel="noopener">Large Language Model (LLM)&lt;/a> tools like ChatGPT become more common and powerful,
instructors need tools to help determine if students are the actual authors of the code they submit.
More classical instances of plagiarism are often discovered by code similarity tools like &lt;a href="https://theory.stanford.edu/~aiken/moss/" target="_blank" rel="noopener">MOSS&lt;/a>.
However these tools are not sufficient for detecting code written not by a student,
but by an AI model like &lt;a href="https://en.wikipedia.org/wiki/ChatGPT" target="_blank" rel="noopener">ChatGPT&lt;/a> or &lt;a href="https://en.wikipedia.org/wiki/GitHub_Copilot" target="_blank" rel="noopener">GitHub Copilot&lt;/a>.&lt;/p>
&lt;p>The task for this project is to create a system that provides a score indicating the system&amp;rsquo;s confidence that a given piece of code was written by an AI tool and not a student.
This will supplement the existing code analysis tools in the Autograder.
There are many approaches to completing this task that will be considered.
A more software development approach can consist of levering exiting systems to create a production-ready system,
whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.&lt;/p>
&lt;p>See Also:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server" target="_blank" rel="noopener">Repository for Autograder Server&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/issues/140" target="_blank" rel="noopener">GitHub Issue&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="code-analysis-gui">Code Analysis GUI&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Frontend&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> software development, frontend, data munging, js, css, go&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Easy&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:linqs.osre25@gmail.com">Eriq Augustine&lt;/a>, &lt;a href="mailto:linqs.osre25@gmail.com">Fabrice Kurmann&lt;/a>, &lt;a href="mailto:linqs.osre25@gmail.com">Lise Getoor&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The Autograder has existing functionality to analyze the code in a student&amp;rsquo;s submission for malicious content.
Relevant to this project is that the Autograder can run a pairwise similarity analysis against all submitted code.
This is how most existing software plagiarism systems detect offending code.
The existing infrastructure provides detailed statistics on code similarity,
but does not currently have a visual way to display this data.&lt;/p>
&lt;p>The task for this project is to create a web GUI using the Autograder REST API
to display the results of a code analysis.
The size of this project depends on how many of the existing features are going to be supported by the web GUI.&lt;/p>
&lt;p>See Also:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-web" target="_blank" rel="noopener">Repository for Autograder Web GUI&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/issues/142" target="_blank" rel="noopener">GitHub Issue&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/blob/main/internal/model/analysis.go#L78" target="_blank" rel="noopener">Pairwise Code Analysis Type&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-py/blob/main/tests/api/testdata/courses/assignments/submit/analysis/course_assignments_submissions_analysis_pairwise_wait.json" target="_blank" rel="noopener">Sample API Data&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="web-gui">Web GUI&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Frontend&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> software development, frontend, js, css&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Easy&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:linqs.osre25@gmail.com">Eriq Augustine&lt;/a>, &lt;a href="mailto:linqs.osre25@gmail.com">Fabrice Kurmann&lt;/a>, &lt;a href="mailto:linqs.osre25@gmail.com">Lise Getoor&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The Autograder contains dozens of &lt;a href="https://github.com/edulinq/autograder-server/blob/main/resources/api.json" target="_blank" rel="noopener">API endpoints&lt;/a>,
most directly representing a piece of functionality exposed to the user.
All of these features are exposed in the &lt;a href="https://github.com/edulinq/autograder-py" target="_blank" rel="noopener">Autograder&amp;rsquo;s Python Interface&lt;/a>.
However, the Python interface is a purely command-line interface.
And although command-line interface are objectively (read: subjectively) the best,
a web GUI would be more accessible to a wider audience.
The autograder already has a web GUI,
but it does not cover all the features available in the Autograder.&lt;/p>
&lt;p>The task for this project is to augment the Autograder&amp;rsquo;s web GUI with more features.
Specifically, add support for more tools used to create and administer courses.&lt;/p>
&lt;p>See Also:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-web" target="_blank" rel="noopener">Repository for Autograder Web GUI&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/issues/61" target="_blank" rel="noopener">GitHub Issue&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-server/blob/main/resources/api.json" target="_blank" rel="noopener">Autograder API Endpoints&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/edulinq/autograder-py" target="_blank" rel="noopener">Autograder&amp;rsquo;s Python Interface&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Mediglot</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/polyphy/</link><pubDate>Tue, 04 Feb 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/polyphy/</guid><description>&lt;p>&lt;a href="https://github.com/PolyPhyHub/PolyPhy" target="_blank" rel="noopener">PolyPhy&lt;/a> is a GPU-oriented agent-based system for reconstructing and visualizing &lt;em>optimal transport networks&lt;/em> defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called &lt;a href="https://github.com/CreativeCodingLab/Polyphorm" target="_blank" rel="noopener">Polyphorm&lt;/a> to reconstruct the &lt;a href="https://youtu.be/5ILwq5OFuwY" target="_blank" rel="noopener">Cosmic web&lt;/a> structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our &lt;a href="https://elek.pub/workshop_cross2022.html" target="_blank" rel="noopener">workshop&lt;/a> and more details about our research &lt;a href="https://elek.pub/projects/Rhizome-Cosmology" target="_blank" rel="noopener">here&lt;/a>. Recent projects, such as &lt;a href="https://github.com/PolyPhyHub/PolyGlot" target="_blank" rel="noopener">Polyglot&lt;/a> and &lt;a href="https://github.com/Ayush-Sharma410/MediGlot" target="_blank" rel="noopener">Mediglot&lt;/a> have focused on using PolyPhy to better visualize language embeddings.&lt;/p>
&lt;h3 id="medicinal-language-embeddings">Medicinal Language Embeddings&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Large Language Models&lt;/code> &lt;code>NLP&lt;/code> &lt;code>Embeddings&lt;/code> &lt;code>Medicine&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, JavaScript, Data Science, Technical Communication&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="mailto:kdeol@ualberta.ca">Kiran Deol&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>This project aims to refine and enhance Mediglot, a web application for visualizing 3D medicinal embeddings, which extends the Polyglot app and leverages the PolyPhy toolkit for network-inspired data science. Mediglot currently enables users to explore high-dimensional vector representations of medicines (derived from their salt compositions) in a 3D space using UMAP, as well as analyze similarity through the innovative Monte-Carlo Physarum Machine (MCPM) metric. Unlike traditional language data, medicinal embeddings do not have an inherent sequential structure. Instead, we must work with the salt compositions of each medicine to create embeddings that are faithful to the intended purpose of each medicine.&lt;/p>
&lt;p>This year, we would like to focus on exploring and integrating state-of-the-art AI techniques and algorithms to improve Mediglot&amp;rsquo;s clustering capabilities and its representation of medicinal data in 3D. The contributor will experiment with advanced large language models (LLMs) and cutting-edge AI methods to develop innovative approaches for refining clustering and extracting deeper insights from medicinal embeddings. Beyond LLMs, we would like to experiment with more traditional language processing methods to design novel embedding procedures. Additionally, we would like to experiment with other similarity metrics. While the similarity of two medicines depends on the initial embedding, we would like to examine the effects of different metrics on the kinds of insights a user can extract. Finally, the contributor is expected to evaluate and compare different algorithms for dimensionality reduction to enhance the faithfulness of the visualization and its interpretability.&lt;/p>
&lt;p>The ideal contributor for this project has experience with Python (and common scientific toolkits such as NumPy, Pandas, SciPy). They will also need some experience with JavaScript and web development (MediGlot is distributed as a vanilla JS web app). Knowledge of embedding techniques for language processing is highly recommended.&lt;/p>
&lt;p>&lt;strong>Specific tasks:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Closely work with the mentors to understand the context of the project and its detailed requirements in preparation for the proposal.&lt;/li>
&lt;li>Become acquainted with the tooling (PolyPhy, PolyGlot, Mediglot) prior to the start of the project period.&lt;/li>
&lt;li>Explore different embedding techniques for medicinal data (including implementing novel embedding procedures).&lt;/li>
&lt;li>Explore different dimensionality reduction techniques, with a focus on faithful visualizations.&lt;/li>
&lt;li>Document the process and resulting findings in a publicly available report.&lt;/li>
&lt;/ul>
&lt;h3 id="enhancing-polyphy-web-application">Enhancing PolyPhy Web Application&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Web Development&lt;/code> &lt;code>UI/UX Design&lt;/code> &lt;code>Full Stack Development&lt;/code> &lt;code>JavaScript&lt;/code> &lt;code>Next.js&lt;/code> &lt;code>Node.js&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Full Stack Web Development, UI/UX Design, JavaScript, Next.js, Node.js, Technical Communication&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium (175 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="mailto:kdeol@ualberta.ca">Kiran Deol&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>This project aims to revamp and enhance the PolyPhy web platform to better support contributors, users, and researchers. The goal is to optimize the website’s UI/UX, improve its performance, and integrate Mediglot to provide users with a seamless experience in visualizing both general network structures and 3D medicinal embeddings.&lt;/p>
&lt;p>The contributor will be responsible for improving the website’s overall look, feel, and functionality, ensuring a smooth and engaging experience for both contributors and end-users. This includes addressing front-end and back-end challenges, optimizing the platform for better accessibility, and ensuring seamless integration with Mediglot.&lt;/p>
&lt;p>The ideal candidate should have experience in full-stack web development, particularly with &lt;strong>Next.js&lt;/strong>, &lt;strong>JavaScript&lt;/strong>, and &lt;strong>Node.js&lt;/strong>, and should be familiar with UI/UX design principles. A strong ability to communicate effectively, both in writing and through code, is essential for this role.&lt;/p>
&lt;p>&lt;strong>Specific tasks:&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Collaborate with mentors&lt;/strong> to understand the project&amp;rsquo;s goals and the specific requirements for the website improvements.&lt;/li>
&lt;li>&lt;strong>UI/UX Redesign&lt;/strong>:
&lt;ul>
&lt;li>Redesign and enhance the website’s navigation, layout, and visual elements to create an intuitive and visually engaging experience.&lt;/li>
&lt;li>Improve mobile responsiveness for broader accessibility across devices.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Website Performance &amp;amp; Stability&lt;/strong>:
&lt;ul>
&lt;li>Identify and resolve performance bottlenecks, bugs, or issues affecting speed, stability, and usability.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Mediglot Integration&lt;/strong>:
&lt;ul>
&lt;li>Integrate the Mediglot web application with PolyPhy, ensuring seamless functionality and a unified user experience for visualizing medicinal data alongside general network reconstructions.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Documentation&lt;/strong>:
&lt;ul>
&lt;li>Document the development process, challenges, and solutions in a clear and organized manner, ensuring transparent collaboration with mentors and the community.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol></description></item><item><title>Midway Through GSoC</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/drishti/20240714-jaytau/</link><pubDate>Wed, 31 Jul 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/drishti/20240714-jaytau/</guid><description>&lt;p>Hello everyone! I&amp;rsquo;m &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joel-tony/">Joel Tony&lt;/a>, and I&amp;rsquo;m excited to share my progress update on the &lt;a href="https://github.com/hpc-io/drishti" target="_blank" rel="noopener">Drishti&lt;/a> project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I&amp;rsquo;ve been diving deep into the world of I/O visualization for scientific applications, and I&amp;rsquo;m thrilled to tell you about the strides we&amp;rsquo;ve made.&lt;/p>
&lt;h2 id="what-is-drishti">What is Drishti?&lt;/h2>
&lt;p>For those unfamiliar with Drishti, it&amp;rsquo;s an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using &lt;a href="https://wordpress.cels.anl.gov/darshan/" target="_blank" rel="noopener">Darshan&lt;/a>, a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application&amp;rsquo;s I/O patterns, making it easier to identify bottlenecks and optimize performance.&lt;/p>
&lt;h2 id="progress-and-challenges">Progress and Challenges&lt;/h2>
&lt;h3 id="export-directory-feature">Export Directory Feature&lt;/h3>
&lt;p>One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn&amp;rsquo;t select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.&lt;/p>
&lt;h3 id="ci-improvements-and-cross-project-dependencies">CI Improvements and Cross-Project Dependencies&lt;/h3>
&lt;p>While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don&amp;rsquo;t adequately test the interactions between different branches of these interconnected tools. This is an area we&amp;rsquo;ve identified for future improvement to ensure smoother integration and fewer conflicts between projects.&lt;/p>
&lt;h3 id="refactoring-for-multi-file-support">Refactoring for Multi-File Support&lt;/h3>
&lt;p>The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti&amp;rsquo;s insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.&lt;/p>
&lt;p>The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:&lt;/p>
&lt;ol>
&lt;li>Better separation of computation and condition checking&lt;/li>
&lt;li>Easier parallelization of processing multiple traces&lt;/li>
&lt;li>Finer-grained profiling of performance bottlenecks&lt;/li>
&lt;li>More flexibility in data manipulation and memory management&lt;/li>
&lt;/ol>
&lt;h2 id="learnings-and-skills-gained">Learnings and Skills Gained&lt;/h2>
&lt;p>Through this process, I&amp;rsquo;ve gained valuable insights into:&lt;/p>
&lt;ol>
&lt;li>Refactoring large codebases&lt;/li>
&lt;li>Understanding and improving cross-project dependencies&lt;/li>
&lt;li>Implementing data classes in Python for better code organization&lt;/li>
&lt;li>Balancing performance with code readability and maintainability&lt;/li>
&lt;/ol>
&lt;h2 id="next-steps">Next Steps&lt;/h2>
&lt;p>As I move forward with the project, my focus will be on:&lt;/p>
&lt;ol>
&lt;li>Adding unit tests for individual methods to ensure functionality&lt;/li>
&lt;li>Exploring alternative data frame implementations like Polars for better performance&lt;/li>
&lt;li>Developing aggregation methods for different types of data across multiple Darshan files&lt;/li>
&lt;li>Optimizing memory usage and computational efficiency for large datasets&lt;/li>
&lt;/ol>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Working on Drishti has been an incredible learning experience. I&amp;rsquo;ve had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I&amp;rsquo;m excited about the potential impact of these improvements on the scientific community&amp;rsquo;s ability to optimize their applications&amp;rsquo; I/O performance.&lt;/p>
&lt;p>I&amp;rsquo;m grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!&lt;/p>
&lt;p>If you have any questions or would like to learn more about the project, feel free to &lt;a href="https://www.jaytau.com/#contact?ref=uc-ospo" target="_blank" rel="noopener">reach out to me&lt;/a>. Let&amp;rsquo;s keep pushing the boundaries of scientific computing together!&lt;/p></description></item><item><title>Drishti</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/drishti/20240614-jaytau/</link><pubDate>Thu, 06 Jun 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/drishti/20240614-jaytau/</guid><description>&lt;p>Namaste everyone! 🙏🏻&lt;/p>
&lt;p>I&amp;rsquo;m &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joel-tony/">Joel Tony&lt;/a>, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I&amp;rsquo;m truly honored to be part of this year&amp;rsquo;s Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I&amp;rsquo;m particularly grateful to be working under the mentorship of Dr. &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a>, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. &lt;a href="https://sbyna.github.io" target="_blank" rel="noopener">Suren Byna&lt;/a>, a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.&lt;/p>
&lt;p>My project, &amp;ldquo;&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/drishti">Drishti: Visualization and Analysis of AI-based Applications&lt;/a>&amp;rdquo;, aims to extend the &lt;a href="https://github.com/hpc-io/drishti" target="_blank" rel="noopener">Drishti&lt;/a> framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer&amp;rsquo;s memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.&lt;/p>
&lt;p>Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I&amp;rsquo;m working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.&lt;/p>
&lt;p>Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I&amp;rsquo;ve gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn&amp;rsquo;t just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.&lt;/p>
&lt;p>As outlined in my &lt;a href="https://docs.google.com/document/d/1zfQclXYWFswUbHuuwEU7bjjTvzS3gRCyNci08lTR3Rg/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>, my tasks are threefold:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Modularize Drishti&amp;rsquo;s codebase&lt;/strong>: Currently, it&amp;rsquo;s a single 1700-line file that handles multiple functionalities. I&amp;rsquo;ll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.&lt;/li>
&lt;li>&lt;strong>Enable multi-trace handling&lt;/strong>: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I&amp;rsquo;ll build a layer to aggregate these, providing a comprehensive view of the application&amp;rsquo;s I/O behavior.&lt;/li>
&lt;li>&lt;strong>Craft AI/ML-specific recommendations&lt;/strong>: Current suggestions often involve MPI-IO or HDF5, which aren&amp;rsquo;t typical in ML frameworks like PyTorch or TensorFlow. I&amp;rsquo;ll create targeted recommendations that align with these frameworks&amp;rsquo; data pipelines.&lt;/li>
&lt;/ol>
&lt;p>This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it&amp;rsquo;s dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.&lt;/p>
&lt;p>From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.&lt;/p>
&lt;p>In today&amp;rsquo;s AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we&amp;rsquo;re not just tweaking code. We&amp;rsquo;re providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.&lt;/p>
&lt;p>I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I&amp;rsquo;m deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.&lt;/p></description></item><item><title>Drishti</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/drishti/</link><pubDate>Tue, 30 Jan 2024 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/drishti/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/drishti" target="_blank" rel="noopener">Drishti&lt;/a> is a novel interactive web-based analysis framework to visualize I/O traces, highlight bottlenecks, and help understand the I/O behavior of scientific applications. Drishti aims to fill the gap between the trace collection, analysis, and tuning phases. The framework contains an interactive I/O trace analysis component for end-users to visually inspect their applications&amp;rsquo; I/O behavior, focusing on areas of interest and getting a clear picture of common root causes of I/O performance bottlenecks. Based on the automatic detection of I/O performance bottlenecks, our framework maps numerous common and well-known bottlenecks and their solution recommendations that can be implemented by users.&lt;/p>
&lt;h3 id="drishti--server-side-visualization-service">Drishti / Server-side Visualization Service&lt;/h3>
&lt;p>The proposed work will include investigating and building server-side solutions to support the visualization of larger I/O traces and logs, while integrating with the existing analysis, reports, and recommendations.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>visualization&lt;/code>, &lt;code>performance analysis&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, HTML/CSS, JavaScript&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="drishti--visualization-and-analysis-of-ai-based-applications">Drishti / Visualization and Analysis of AI-based Applications&lt;/h3>
&lt;p>Drishti to handle metrics from non-MPI applications, specifically, AI/ML codes and applications. This work entails adapting the existing framework, heuristics, and recommendations to support metrics collected from AI/ML workloads.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>AI&lt;/code> &lt;code>visualization&lt;/code>, &lt;code>performance analysis&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, AI, performance profiling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>h5bench</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/h5bench/</link><pubDate>Tue, 30 Jan 2024 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/h5bench/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/h5bench" target="_blank" rel="noopener">h5bench&lt;/a> is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.&lt;/p>
&lt;p>Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library&amp;rsquo;s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.&lt;/p>
&lt;h3 id="h5bench--reporting-and-enhancing">h5bench / Reporting and Enhancing&lt;/h3>
&lt;p>The proposed work will include standardizing and enhancing the reports generated by the suite, and integrate additional I/O kernels (e.g., HACC-IO).&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>benchmarking&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="h5bench--compression">h5bench / Compression&lt;/h3>
&lt;p>The proposed work will focus on including compression capabilities into the h5bench core access patterns through HDF5 filters.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>benchmarking&lt;/code>, &lt;code>compression&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C/C++, Python, HDF5&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>PolyPhy</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ucsc/polyphy/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/ucsc/polyphy/</guid><description>&lt;p>&lt;a href="https://github.com/PolyPhyHub/PolyPhy" target="_blank" rel="noopener">PolyPhy&lt;/a> is a GPU oriented agent-based system for reconstructing and visualizing &lt;em>optimal transport networks&lt;/em> defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called &lt;a href="https://github.com/CreativeCodingLab/Polyphorm" target="_blank" rel="noopener">Polyphorm&lt;/a> to reconstruct the &lt;a href="https://youtu.be/5ILwq5OFuwY" target="_blank" rel="noopener">Cosmic web&lt;/a> structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our &lt;a href="https://elek.pub/workshop_cross2022.html" target="_blank" rel="noopener">workshop&lt;/a> and more details about our research &lt;a href="https://elek.pub/projects/Rhizome-Cosmology" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;p>Under the hood, PolyPhy uses a richer 3D scalar field representation of the reconstructed network, instead of a typical discrete representation like a graph or a mesh. The ultimate purpose of PolyPhy is to become a toolkit for a range of specialists across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. PolyPhy aspires to be a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.&lt;/p>
&lt;h3 id="polyphy-web-presence">PolyPhy Web Presence&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Web Development&lt;/code> &lt;code>UX&lt;/code> &lt;code>Social Media&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> full stack web development, Javascript, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="mailto:ez@nmsu.edu">Ezra Huscher&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The online presentation of a software project is without a doubt one of the core ingredients of its success. This project aims to develop a sustainable web presentce for PolyPhy, catering to interested contributors, active collaborators, and users alike.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Closely work with the mentors on understanding the context of the project and its detailed requirements in preparation of the proposal.&lt;/li>
&lt;li>Port the existing &lt;a href="https://polyphy.io" target="_blank" rel="noopener">website&lt;/a> into a more modern Javascript framework (such as Next.js) that provides a user-friendly CMS and admin interface.&lt;/li>
&lt;li>Update the contents of the website with new information from the repository &lt;a href="https://github.com/CreativeCodingLab/Polyphorm" target="_blank" rel="noopener">repository page&lt;/a> as well as other sources as directed by the mentors.&lt;/li>
&lt;li>Develop a simple functional system for posting updates about the project to selected social media and other communication platforms (LinkedIn, Twitter/X or Mastodon, mailing list) which will also be reflected on the website.&lt;/li>
&lt;li>Optional: improve the UX of the website where needed.&lt;/li>
&lt;li>Optional: implement website analytics (visitor stats etc).&lt;/li>
&lt;/ul>
&lt;h3 id="data-visualization-and-analysis-with-polyphypolyglot">Data Visualization and Analysis with PolyPhy/Polyglot&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Data Science&lt;/code> &lt;code>Data Visualization&lt;/code> &lt;code>Point Clustering&lt;/code> &lt;code>3D&lt;/code> &lt;code>Neural Embeddings&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> data science, Python, Javascript, statistics, familiarity with AI and latent embedding spaces a big plus&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350+ hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/kiran-deol/">Kiran Deol&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The aim of this project is to explore a novel data-scientific usecase using PolyPhy and its associated web visualization interface &lt;a href="https://github.com/PolyPhyHub/PolyGlot" target="_blank" rel="noopener">PolyGlot&lt;/a>. The contributor is expected to identify a dataset they are already well familiar with, and that fits the application scope of the PolyPhy/PolyGlot tooling: a complex point cloud arising from a 3D or a higher dimensional process which will benefit from latent pattern identification and a subsequent visual as well as quantitative analysis. The contributor needs to have the rights for using the dataset - either by owning the copyright or via the open-source nature of the data.&lt;/p>
&lt;p>&lt;strong>Specific tasks:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Closely work with the mentors on understanding the context of the project and its detailed requirements in preparation of the proposal.&lt;/li>
&lt;li>Become acquainted with the tooling (PolyPhy, PolyGlot) prior to the start of the project period.&lt;/li>
&lt;li>Document the nature of the target dataset and define the complete data pipeline with assistance of the mentors, including the specific analytic tasks and objectives.&lt;/li>
&lt;li>Implement the data pipeline in PolyPhy and PolyGlot.&lt;/li>
&lt;li>Document the process and resulting findings in a publicly available report.&lt;/li>
&lt;/ul></description></item><item><title>noWorkflow as an experiment management tool - Final Report</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230914-jesselima/</link><pubDate>Thu, 14 Sep 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230914-jesselima/</guid><description>&lt;p>This post describes our midterm work status and some achievements we
have made so far in our project
&lt;a href="https://docs.google.com/document/d/1YMtPjZXcgt5eplyxIgQE8IBpQIiRlB9eqVSQiIPhXNU/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>
for
&lt;a href="https://ospo.ucsc.edu/project/osre23/nyu/noworkflow" target="_blank" rel="noopener">noWorkflow&lt;/a>.&lt;/p>
&lt;p>For a more friendly introduction to our work, please, refer to this
&lt;a href="https://github.com/jaglima/noworkflow_usecase/blob/main/README.md" target="_blank" rel="noopener">tutorial
available&lt;/a>.&lt;/p>
&lt;p>Our final code to merge is available in &lt;a href="https://github.com/jaglima/noworkflow/tree/sor_features" target="_blank" rel="noopener">this repository&lt;/a>.&lt;/p>
&lt;h2 id="different-ways-of-managing-experiments">Different ways of managing experiments&lt;/h2>
&lt;p>From our starting point at the midterm, and from our initial aspirations
for the SoR, we kept on track with the goal of adding features to
noWorkflow related to managing DS/ML experimental setups focusing on
reproducibility.&lt;/p>
&lt;p>With the emergence of IA across multiple fields in industry and
academia, the subject of reproducibility has become increasingly
relevant. In [1] we have an
interesting description of the sources of irreproducibility in Machine
Learning. All these sources are present at different stages during the
project's experimental phases and may even persist in production
environments, leading to the accumulation of technical debt
[2]. The problem of
irreproducibility is also discussed in [[3],
[4]], pointing out that the
velocity of deliverances usually comes at the expense of
reproducibility, among other victims.&lt;/p>
&lt;p>The CRISP-DM process as reviewed in
[5] demonstrates that Data
Science experiments follows a typical path of execution. In the same
manner, [[3], [6],
[7]], points out that
Machine Learning pipelines are composed of well-defined layers (or
stages) through its lifecycle. The emergence of IA in real world
applications stressed the almost artisanal ways of creating and managing
analytical experiments and reinforced that there is room to make things
more efficiently.&lt;/p>
&lt;p>In the search for possible approaches to the problem, we came across
several projects that aimed to address these issues. Not surprisingly,
multiple authors pursued the same goal, for instance [[9],
[10]]. In these references,
and confirmed in our survey, we found from targeted solutions to
specific steps in modeling to services aiming for end-to-end AIOps
management. Some are available as software packages, others as SaaS in
cloud environments. In general terms, all of them end up offering
features in different layers of the workflow (i.e. data, feature,
scoring, and evaluation) or with different conceptualizations of
reproducibility/replicability/repeatability as noticed by
[11]. On one hand, this lack of
standards makes any assessment difficult. On the other hand, it suggests
a community in an exploratory process of a hot topic subject.&lt;/p>
&lt;p>Specifically for this project, our focus is in the initial stages of
computational scientific experiments. As studied in [8], in this
phase, experiments are i) implemented by people as prototypes, ii) with
minor focus on pipeline design and iii) in tools like Notebooks, that
mix documentation, visualization and code with no required sequential
structure. These three practices impact reproducibility and efficiency
and are prone to create technical debts. However, tools like noWorkflow
show a huge potential in such scenarios. It is promising because they i)
demands a minimal setup to be functional, ii) works well with almost
nonexistent workflows iii) require minimal additional intrusive code
among the experimental one and iv) integrates well with Notebooks that
are the typical artifact in these experiments.&lt;/p>
&lt;p>According to its core team, the primary goal of noWorkflow is to
&amp;quot;...allow scientists to benefit from provenance data analysis even
when they don't use a workflow system.&amp;quot;. Unlike other tools,
&amp;quot;noWorkflow captures provenance from Python scripts without needing a
version control system or any other environment&amp;quot;. It is particularly
interesting when we are in the scenario described above, where we lack
any structured system at the beginning of experiments. In fact, after
going through the docs, we can verify that noWorkflow provides:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Command-line accessibility&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Seamless integration with Jupyter Notebooks&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Minimal setup requirements in your environment&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Elimination of the need for virtual machines or containers in its
setup&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Workflow-free operation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Open source license&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Framework-agnostic position&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Finally, in our research, we confirmed that there is an open spot in the
management of scientific experiments that needs to be occupied by
reproducibility. Provenance tools can help the academy and industry
groups in this goal, and in this summer we focused on adding relevant
features to leverage the noWorkflow in this direction.&lt;/p>
&lt;h2 id="different-tools-for-different-needs">Different tools for different needs&lt;/h2>
&lt;p>In our research phase, we didn't find any taxonomy that fully
accommodated our review of different categories of tools providing
reproducibility and experimental management. So, we could describe some
tools in the following categories (freely adapted from this online
references
&lt;a href="https://ml-ops.org/content/mlops-principles" target="_blank" rel="noopener">[here]&lt;/a> and
&lt;a href="https://ambiata.com/blog/2020-12-07-mlops-tools/" target="_blank" rel="noopener">[here]&lt;/a>):&lt;/p>
&lt;p>&lt;strong>Data and Pipeline Versioning&lt;/strong>: Platforms dealing with ingestion,
processing, and exposing of features for model training and inference.
They enable collaboration and discoverability of already existing
Feature Sets throughout the teams and organizations. Provide provenance
and lineage for data in different levels of complexity.&lt;/p>
&lt;p>&lt;strong>Metadata Stores/Experiment Trackers&lt;/strong>: They are specifically built to
store metadata about ML experiments and expose it to stakeholders. They
help with debugging, comparing, and collaborating on experiments. It is
possible to divide them into Experiment Trackers and a Model Registry.
Moreover, there are projects offering reproducibility features like
hyperparameter search, experiment versioning, etc. However, they demand
more robust workflows and are better suited for projects in the
production/monitoring phases.&lt;/p>
&lt;p>&lt;strong>Pipeline frameworks&lt;/strong>: They operate within the realm of production,
similar to Data Engineering workflows. Their usual goal is to allow any
ML/AI products to be served across a wide range of architectures, and
integrate all the low-hanging fruits along the way. For instance,
pipelines adding hyperparameter optimization tasks, experiment tracking
integrations, boilerplate containerized deployment, etc.&lt;/p>
&lt;p>&lt;strong>Deployment and Observability&lt;/strong>: They focus on deploying models for
real-time inference and monitoring model quality once they are deployed
in production. Their aim is to facilitate post-deployment control tasks
such as monitoring feature drifts, conducting A/B testing, facilitating
fast model shifts, and more.&lt;/p>
&lt;p>The most remarkable aspect of this survey is that there are different
tools for different phases in the life cycle of AI products. There are
tools like DVC and Pachyderm that are Metadata Stores, allowing
Experiment Tracking with features of tagging variables, as well as Data
and Pipeline tracking. They are the most similar tools to noWorkflow in
functionality. However, DVC possesses a more complex framework in
dealing with different 'types' of tags, and relies on command line
tools to extract and analyze tagged variables. Also, it depends strongly
on git and replicate the git logics. Pachyderm requires a more
sophisticated setup at the start, relying on containers and a server. It
is an obstacle to small and lean prototypes, requiring installation of a
docker image, and all friction on managing it.&lt;/p>
&lt;p>There are other tools, like MLFlow and Neptune that pose themselves as
Model Experiment Versioning with features of Monitoring and Deployment.
They also have elements of pipeline frameworks, offering full
integration and boiler plates for seamless integration with cloud
platforms.&lt;/p>
&lt;p>Pipelines are a vast field. They are AWS SageMaker, Google Vertex,
DataRobot and Weights &amp;amp; Biases, among others. All of them offer features
helping in all categories, with a strong focus on exploring all
automation that can be offered to the final user, suggesting automatic
parameter tuning, model selection, retraining, data lineage, metadata
storing, etc.&lt;/p>
&lt;p>Finally, Deployment and Observability frameworks are in the deployment
realm, which is another stage far removed from prototypical phases of
experiments. They come into the scene when all experimental and
inferential processes are done, and there is an AI artifact that needs
to be deployed and monitored. Such tools like Seldon, H2O, Datarobot do
this job, again, with some features of Hyperparameter tuning, pipeline
frameworks, data and pipeline tracking.&lt;/p>
&lt;p>In light of this, when considering management and operation of
experiments, we have a reduced sample of alternatives. Among them,
Notebook integration/management are rare. Some of them rely on other
tools like Git or enforces an overhead in the coding/setup with reserved
keywords, tags and managerial workflows that hinder the process.&lt;/p>
&lt;p>At first sight, our &amp;quot;informal&amp;quot; taxonomy positions noWorkflow as a
Data/Pipeline Versioning and Metadata Store/Experiment Tracker. It is
not a Pipeline Framework which works like a building block, facilitating
the integration of artifacts at production stages. It is not a
Deployment and Observability framework, because they are in the
post-deployment realm, which is another stage far removed from
prototypical phases of experiments.&lt;/p>
&lt;h2 id="desiderata">Desiderata&lt;/h2>
&lt;p>As mentioned earlier, a typical workflow in DS/ML projects is well
described by the CRISP-DM [5]
and precede phases of deployment and production in the whole lifecycle
of DS/ML projects.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image1.png" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Fig 1: CRISP-DM example of trajectory through a data science project&lt;/p>
&lt;p>Briefly speaking, a workflow starts when a user creates a Jupyter
Notebook and starts writing code. Usually, he/she imports or selects
data from a source, explore features which are expected to have the
highest inference potential, tunes some parameters to set up its
training, trains and evaluates the predictive power of the model through
different metrics. At this final step, we have delineated a trial. This
trial result can suggest further improvements and new hypotheses about
data, features, model types and hyperparameters. Then, we have a new
experiment in mind that will result in a new trial.&lt;/p>
&lt;p>When this process repeats multiple times, a researcher may end with
different notebooks storing, each one, a different experiment. Each
notebook has multiple hyperparameters, modeling choices and modeling
hypotheses. Otherwise, the experimenter may have a unique notebook where
different experiments were executed, in a nonlinear order between the
cells. This former case is pointed out in
[8], where Notebook flexibility
makes it difficult to understand which execution order resulted in a
specific output.&lt;/p>
&lt;p>In a dream space, any researcher/team would have benefited at most if
they could&lt;/p>
&lt;p>a) in a running Notebook, being able to retrieve all the operations
that contributed to the result of a variable of interest. In this
case, modifications applied in the inputs or in the order of
operations would be easily detectable. In the same way, any
nonlinear execution that interferes in a control result.&lt;/p>
&lt;p>b) Compare trials after different experiments. After experimenting with
different hypotheses about hyperparameters, features or operation
order, the user should easily compare the history of two trials and
spot differences.&lt;/p>
&lt;p>c) Retrieve a target variable among different trials that were executed
in the context of an experiment. After proceeding with multiple
experimental trials, users should be able to compare the results
that are stored in different Notebooks (or even not).&lt;/p>
&lt;p>d) Be as much &amp;quot;no workflow&amp;quot; as possible. All the former requisites
should be possible with minimal code intervention, tags, reserved
words or any active coding effort.&lt;/p>
&lt;p>With these goals in mind, we worked on our deliverables and used the
experiment carried out by [12]
as a guideline to validate the new noWorkflow features.&lt;/p>
&lt;h2 id="deliverables">Deliverables&lt;/h2>
&lt;p>In this session, we will describe what we have implemented during this
summer.&lt;/p>
&lt;p>We started on tagging cells and variables and then navigating through
its pre-dependencies, or all other variables and function calls that
contributed to its final value. This was a fundamental step that allowed
us to evolve to create features that are really useful in day-to-day
practice.&lt;/p>
&lt;p>From the features of tagging a cell and tagging a variable, we evolved
to the following features (an interactive notebook is available here):&lt;/p>
&lt;ul>
&lt;li>&lt;em>backwards_deps('var_name', glanularity_level)&lt;/em> : returns a
dictionary storing operations/functions calls and their associated
values that contributed to the final value of the tagged variable.
Glanularity_level allows to set if the internal operations of the
functions must be included or not.&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image5.png" alt="backwards_deps" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>
&lt;p>&lt;em>global_backwards_deps&lt;/em>('var_name', glanularity_level) : does the
same as backwards_deps, but from all different tagging and
re-tagging events in the notebook. It allows to retrieval of the
complete operation of a tagged variable across all executed cells in
the notebook&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>store_operations(trial_id, dictionary_ops)&lt;/em> : save the current
trial in order to make further comparisons with other experiments.
The dictionaries aren't stored in the &lt;em>.noworkflow/db.sqlite&lt;/em>, but
in a shelve object named *ops.db* in the current notebook local
folder.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>resume_trials()&lt;/em> : to support the management of experiments, the
user can see the trial_ids of all experiments stored in the ops.db
available for comparison/analysis.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>trial_intersection_diff(trial_id1, trial_id2)&lt;/em> : all mutual
variables/funcion_calls between two experiments have its scalar
values compared&lt;/p>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image2.png" alt="trial_intersection_diff" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>&lt;em>trial_diff(trial_id1, trial_id2)&lt;/em> : The values of variables and
function calls are exhibited in a diff file format, emphasizing the
operations' order. The goal here is to show that between the two
experiments, the order of operations was different. Again, only
scalar values are exhibited. More complex data structures (matrices,
vectors, tensors, etc.) are only signaled as &lt;em>'complex_type'&lt;/em>&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image3.png" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>&lt;em>var_tag_plot('var_name')&lt;/em> : Chart the evolution of a given
variable across multiple trials in the database. In this case, all
experiments stored in ops.db and tagged as *target_var* have their
values plotted&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image4.png" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>&lt;em>var_tag_values('var_name') :&lt;/em> Provides access to pandas.dataframe
var_name entries with correspondent values across different trials.&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image6.png" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="challenges">Challenges&lt;/h2>
&lt;p>As expected, we had unexpected findings along the project. Bellow, we
delve into the most significant challenges we had to face:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Jupyter notebooks allow a nonlinear execution of small parts of code
through cells. More than once, we had to align about how to create
functionalities to attend different scenarios that were unexpected.
One example was the backwards_deps() and global_backwards_deps()
functions. The latter function was born to cover the case where the
user wants all dependencies rather than the local cell dependencies.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Despite the high quality of the current version of the package, the
project needs documentation, which slows down the analysis of any
new development. In this project, the aid of mentors was crucial at
some points where a deeper knowledge was needed.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>What is the vocation of noWorkflow? At some points in the project,
we had to discuss forcing some kind of workflow over the user. And
it would go against the philosophy of the project.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>When working on comparing results, especially in DS/ML fields,
complex types arise. Numerical vectors, matrices, and tensors from
NumPy and other frameworks, as well as data frames, can't be
properly manipulated based on our current approach.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The dilemma of focusing on graphic visual features versus more
sophisticated APIs. More than once, we needed to choose between
making a visual add-on to Jupyter or implementing a more complete
API.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The current version of Jupyter support in noWorkflow doesn&amp;rsquo;t
integrate well with Jupyter Lab. Also, even the IPython version has
new versions, and noWorkflow needs to adapt to a new version.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="future-improvements">Future Improvements&lt;/h2>
&lt;p>Given our current achievements and the insights gained along the
project, we would highlight the following points as crucial future
roadmap improvements:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Add a complex type treatment for comparisons. Today, visualizing and
navigating through matrices, data frames, tensors, isn't possible
with noWorkflow, although the user can do by its own means.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Integrate the dictionaries storing sequences of operations from
shelve objects to a more efficient way of storage and retrieval.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Make it easier for users to manage (store, retrieve, and navigate)
through different trials.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Add graphical management instead of relying upon API calls only.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Evolve the feature of tagging cells.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>When tagging a model, save its binary representation to be recovered
in the future.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Adding the capability of tracking the local dataset reading.
Currently, it is possible to track changes in the name/path of the
dataset. However, any modification in the integrity of a dataset is
not traceable.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="what-ive-learned">What I've learned&lt;/h2>
&lt;p>This was a great summer with two personal discoveries. The first one was
my first formal contact with the Reproducibility subject. The second was
to fully contribute with an Open Source project. In the research phase,
I could get in touch with the state-of-the-art of reproducibility
research and some of it nuances. In the Open Source contributing
experience, I could be mentored by the core team of the noWorkflow and
exercise all the skills required in doing high level software product.&lt;/p>
&lt;h2 id="acknowledgments">Acknowledgments&lt;/h2>
&lt;p>I would like to thank the organization of Summer of Reproducibility for
aiding this wonderful opportunity for interested people to engage with
Open Source software. Also, thanks to the core team of noWorkflow for
supporting me in doing this work.&lt;/p>
&lt;h2 id="bibliography">Bibliography&lt;/h2>
&lt;p>[1] [O. E. Gundersen, K. Coakley, C. Kirkpatrick, and Y. Gil, &amp;ldquo;Sources
of irreproducibility in machine learning: A review,&amp;rdquo; &lt;em>arXiv preprint
arXiv:2204. 07610&lt;/em>.]&lt;/p>
&lt;p>[2] [D. Sculley &lt;em>et al.&lt;/em>, &amp;ldquo;Machine Learning: The High Interest Credit
Card of Technical Debt,&amp;rdquo; in &lt;em>SE4ML: Software Engineering for Machine
Learning (NIPS 2014 Workshop)&lt;/em>,
2014.]&lt;/p>
&lt;p>[3] [P. Sugimura and F. Hartl, &amp;ldquo;Building a reproducible machine
learning pipeline,&amp;rdquo; &lt;em>arXiv preprint arXiv:1810. 04570&lt;/em>,
2018.]&lt;/p>
&lt;p>[4] [D. Sculley &lt;em>et al.&lt;/em>, &amp;ldquo;Hidden technical debt in machine learning
systems,&amp;rdquo; &lt;em>Adv. Neural Inf. Process. Syst.&lt;/em>, vol. 28,
2015.]&lt;/p>
&lt;p>[5] [F. Martínez-Plumed &lt;em>et al.&lt;/em>, &amp;ldquo;CRISP-DM twenty years later: From
data mining processes to data science trajectories,&amp;rdquo; &lt;em>IEEE Trans. Knowl.
Data Eng.&lt;/em>, vol. 33, no. 8, pp. 3048&amp;ndash;3061,
2019.]&lt;/p>
&lt;p>[6] [N. A. Lynnerup, L. Nolling, R. Hasle, and J. Hallam, &amp;ldquo;A Survey on
Reproducibility by Evaluating Deep Reinforcement Learning Algorithms on
Real-World Robots,&amp;rdquo; in &lt;em>Proceedings of the Conference on Robot
Learning&lt;/em>, L. P. Kaelbling, D. Kragic, and K. Sugiura, Eds., in
Proceedings of Machine Learning Research, vol. 100. PMLR, 30 Oct--01
Nov 2020, pp. 466&amp;ndash;489.]&lt;/p>
&lt;p>[7] [A. Masood, A. Hashmi, A. Masood, and A. Hashmi, &amp;ldquo;AIOps:
predictive analytics &amp;amp; machine learning in operations,&amp;rdquo; &lt;em>Cognitive
Computing Recipes: Artificial Intelligence Solutions Using Microsoft
Cognitive Services and TensorFlow&lt;/em>, pp. 359&amp;ndash;382,
2019.]&lt;/p>
&lt;p>[8] [J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire,
&amp;ldquo;Understanding and improving the quality and reproducibility of Jupyter
notebooks,&amp;rdquo; &lt;em>Empirical Software Engineering&lt;/em>, vol. 26, no. 4, p. 65,
2021.]&lt;/p>
&lt;p>[9] [D. Kreuzberger, N. Kühl, and S. Hirschl, &amp;ldquo;Machine Learning
Operations (MLOps): Overview, Definition, and Architecture,&amp;rdquo; &lt;em>IEEE
Access&lt;/em>, vol. 11, pp. 31866&amp;ndash;31879,
2023.]&lt;/p>
&lt;p>[10] [N. Hewage and D. Meedeniya, &amp;ldquo;Machine learning operations: A
survey on MLOps tool support,&amp;rdquo; &lt;em>arXiv preprint arXiv:2202. 10169&lt;/em>,
2022.]&lt;/p>
&lt;p>[11] [H. E. Plesser, &amp;ldquo;Reproducibility vs. replicability: a brief
history of a confused terminology,&amp;rdquo; &lt;em>Front. Neuroinform.&lt;/em>, vol. 11, p.
76, 2018.]&lt;/p>
&lt;p>[12] [Z. Salekshahrezaee, J. L. Leevy, and T. M. Khoshgoftaar, &amp;ldquo;The
effect of feature extraction and data sampling on credit card fraud
detection,&amp;rdquo; &lt;em>Journal of Big Data&lt;/em>, vol. 10, no. 1, pp. 1&amp;ndash;17,
2023.]&lt;/p></description></item><item><title>[Mid-term] Capturing provenance into Data Science/Machine Learning workflows</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230731-jesselima/</link><pubDate>Mon, 31 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230731-jesselima/</guid><description>&lt;p>This post describes our midterm work status and some achievements we have done so far in &lt;a href="https://docs.google.com/document/d/1YMtPjZXcgt5eplyxIgQE8IBpQIiRlB9eqVSQiIPhXNU/edit#heading=h.nnxl1g16trg0" target="_blank" rel="noopener">the project&lt;/a> for the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/noworkflow/">noWorkflow&lt;/a> package.&lt;/p>
&lt;h4 id="the-initial-weeks">The initial weeks&lt;/h4>
&lt;p>I started doing a bibliographical review on reproducibility in the Data Science (DS) and Machine Learning (ML) realms. It was a new subject to me, and I aimed to build a more robust theoretical background in the field. Meanwhile, I took notes in &lt;a href="https://jaglima.github.io/" target="_blank" rel="noopener">this series of posts&lt;/a>.&lt;/p>
&lt;p>Then, as planned, I integrated with the current noWorkflow supporters in order get a broader view of the project and their contributions. Additionally, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/juliana-freire/">Juliana Freire&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joao-felipe-pimentel/">João Felipe Pimentel&lt;/a>, and I set up a weekly one-hour schedule to keep track of my activities.&lt;/p>
&lt;h3 id="brainstormed-opportunities">Brainstormed opportunities&lt;/h3>
&lt;p>At the beginning of June, we also met with other project supporters to brainstorm about our initial proposal. From this meeting, we came up with a plan on how technically approach a noWorkflow new feature in Data Science and Machine Learning experimental management.&lt;/p>
&lt;p>In this brainstorm, we aligned that &lt;em>Jupyter Notebooks are, by far, the most frequent set up in DS/ML computational experiments. They established themselves as the fundamental artifact by embedding code, text and enabling execution and visualization. Entire experiments are created and kept in Jupyter notebooks until they are sent to production. And the opportunity at hand is to integrate noWorkflow with Jupyter Notebooks&lt;/em>.
Then, our mid-term goal was adapted from the original plan of only selecting and executing a prototypical ML experiment. We added the goal of paving the way for providing a tagging feature for Notebook cells.&lt;/p>
&lt;p>More specifically, DS/ML experimental workflows usually have well-defined stages composed of &lt;em>data reading&lt;/em>, &lt;em>feature engineering&lt;/em>, &lt;em>model scoring&lt;/em>, and &lt;em>metrics evaluation&lt;/em>. In our dream space, the user would tag a cell in their experiment, enabling the capture of the tagged metadata into a database. This step integrates the ultimate goal of facilitating comparisons, management, and even causal inference across different trials of a DS/ML experiment.&lt;/p>
&lt;h3 id="current-deliverables">Current deliverables&lt;/h3>
&lt;p>So, based on our plans, we create a separate table to store the metadata from cell tagging. This table stores the cell hash codes and information to match the code executed within a cell. As a result, we can store tags and the activation ids of the cells enabling us to identify a cell containing a given stage in a DS/ML experiment.&lt;/p>
&lt;p>The second feature implemented was tagging a specific variable. In the same way for a cell, now it is possible to stamp a given variable with a tag, keeping its name, id, and received value in this separated table.&lt;/p>
&lt;p>Finally, we worked on displaying the dependencies of a given variable. In this case, by tagging a given variable, we can display the other variables, values, and cells activated in its construction. Then, we can visualize the dependencies that contributed to its final value.&lt;/p>
&lt;p>For an overview of current developments, please refer to my &lt;a href="https://github.com/jaglima/noworkflow/tree/stage_tagging" target="_blank" rel="noopener">fork of the main project&lt;/a>.&lt;/p>
&lt;h3 id="challenges">Challenges&lt;/h3>
&lt;p>During this period, we had to make choices along the way. For instance, capturing the provenance of cells through tags is a different solution than tagging code chunks in scripts. In this case, we decided to stick with tagging Notebook cells at this moment. We also opted to start storing the metadata to enable comparisons between trials rather than focus on a sophisticated graphic and user-friendly cell tagging system. We also opted to keep this metadata info stored in a separate table in the database.&lt;/p>
&lt;h3 id="next-steps">Next steps&lt;/h3>
&lt;p>In the second half of the summer, our goal is to integrate these features in order to proceed with comparisons among experiments. Such comparisons would use the tagged variables as the hyperparameters of DS/ML experiments or key variables to assess the experiments, such as errors or scores. As a result, we will be able to compare the results of two trials in a more accurate, and easily reproducible experiment.&lt;/p></description></item><item><title>Verify the reproducibility of an experiment</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230524-jesselima/</link><pubDate>Wed, 24 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230524-jesselima/</guid><description>&lt;p>Hello everyone,
my name is Jesse and I&amp;rsquo;m proud to be a fellow in this 2023 Summer of Reproducibility program, contributing to &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/noworkflow">noWorkflow&lt;/a> project.&lt;/p>
&lt;p>My &lt;a href="https://docs.google.com/document/d/1YMtPjZXcgt5eplyxIgQE8IBpQIiRlB9eqVSQiIPhXNU/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> was accepted under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joao-felipe-pimentel/">João Felipe Pimentel&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/juliana-freire/">Juliana Freire&lt;/a> and aims to
work mapping and testing the capture of the provenance in typical Data Science and Machine Learning experiments.&lt;/p>
&lt;h4 id="what">What&amp;hellip;&lt;/h4>
&lt;p>Although much can be said about what reproducibility means, the ability to replicate results in day-to-day Data Science and Machine Learning experiments can pose a significant challenge for individuals, companies and researche centers. This challenge becomes even more pronounced with the emergence of analytics and IA, where scientific methodologies are extensively applied on an industrial scale. Then reproducibility assumes a key role in productivity and accountability expected from Data Scientists, Machine Learning Engineers, and other roles engaged in ML/AI projects.&lt;/p>
&lt;h4 id="how">How&amp;hellip;&lt;/h4>
&lt;p>In the day-to-day, the pitfalls of non-reproducibility appear at different points of the experiment lifecycle. These challenges arise when multiple experiments need to be managed for an individual or a team of scientists. In a typical experiment workflow, reproducibility appears in different steps of the process:&lt;/p>
&lt;ul>
&lt;li>The need to track the provenance of datasets.&lt;/li>
&lt;li>The need to manage changes in hypothesis tests.&lt;/li>
&lt;li>Addressing the management of system hardware and OS setups.&lt;/li>
&lt;li>Dealing with outputs from multiple experiments, including the results of various model trials.&lt;/li>
&lt;/ul>
&lt;p>In academic environments, these issues can result in mistakes and inaccuracies. In companies, they can lead to inefficiencies and technical debts that are difficult to address in the future.&lt;/p>
&lt;h4 id="finally">Finally&amp;hellip;&lt;/h4>
&lt;p>I believe this is a great opportunity to explore the emergence of these two hot topics that are IA and reproducilibity! I will share more updateds here throughout this summer and hope we can learn a lot together!&lt;/p></description></item><item><title>FlashNet: Towards Reproducible Data Science for Storage System</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet/</link><pubDate>Thu, 02 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet/</guid><description>&lt;p>The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:&lt;/p>
&lt;ol>
&lt;li>How can we encourage data scientists to look into storage problems?&lt;/li>
&lt;li>How can we create a transparent platform that allows such decoupling?&lt;/li>
&lt;li>Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?&lt;/li>
&lt;/ol>
&lt;p>In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.&lt;/p>
&lt;p>In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.&lt;/p>
&lt;h3 id="building-flashnet-platform">Building FlashNet Platform&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage systems, reproducibility, machine learning, continual learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C++, Python, PyTorch, Experienced with Machine Learning pipeline&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haryadi-s.-gunawi/">Haryadi S. Gunawi&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/justin-shin/">Justin Shin&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/maharani-ayu-putri-irawan/">Maharani Ayu Putri Irawan&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Reproduce the FlashNet evaluation results from prior works.&lt;/li>
&lt;li>Build and improve FlashNet components based on the existing blueprint.&lt;/li>
&lt;li>Collect and analyze the FlashNet evaluation results.&lt;/li>
&lt;/ul></description></item><item><title>Polyphorm / PolyPhy</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/polyphy/</link><pubDate>Thu, 15 Dec 2022 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/polyphy/</guid><description>&lt;p>&lt;a href="https://github.com/PolyPhyHub/PolyPhy" target="_blank" rel="noopener">PolyPhy&lt;/a> is a GPU oriented agent-based system for reconstructing and visualizing &lt;em>optimal transport networks&lt;/em> defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called &lt;a href="https://github.com/CreativeCodingLab/Polyphorm" target="_blank" rel="noopener">Polyphorm&lt;/a> to reconstruct the &lt;a href="https://youtu.be/5ILwq5OFuwY" target="_blank" rel="noopener">Cosmic web&lt;/a> structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our &lt;a href="https://elek.pub/workshop_cross2022.html" target="_blank" rel="noopener">workshop&lt;/a> and more details about our research &lt;a href="https://elek.pub/projects/Rhizome-Cosmology" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;p>Under the hood, PolyPhy uses a richer 3D scalar field representation of the reconstructed network, instead of a typical discrete representation like a graph or a mesh. The ultimate purpose of PolyPhy is to become a toolkit for a range of specialists across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. PolyPhy aspires to be a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.&lt;/p>
&lt;h3 id="polyphy-infrastructure-engineering-and-practices">PolyPhy infrastructure engineering and practices&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>DevOps&lt;/code> &lt;code>Code Refactoring&lt;/code> &lt;code>CI/CD&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> fluidity in Python, experience with OOP, experience with building and packaging libraries, understanding GitHub and its tools ecosystem&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350+ hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="mailto:anishagoel14@gmail.com">Anisha Goel&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/prashant-jha/">Prashant Jha&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Your responsibility in this project will be developing new infrastructure of the PolyPhy project as well as maintaining the existing &lt;a href="https://github.com/PolyPhyHub/" target="_blank" rel="noopener">codebases&lt;/a>. This is a multifaceted role that will require coordination with the team and active approach to understanding the technical needs of the community.&lt;/p>
&lt;p>&lt;strong>Specific tasks:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Work with the technical lead to develop effective interfaces for PolyPhy, providing access to its functionality on the level of both Python/Jupyter code and the command line.&lt;/li>
&lt;li>Maintain the existing &lt;a href="https://github.com/PolyPhyHub/PolyPhy" target="_blank" rel="noopener">codebase&lt;/a> and configure it according to the team&amp;rsquo;s needs.&lt;/li>
&lt;li>Develop and extend the current CI/CD functionality and related code metrics.&lt;/li>
&lt;li>Document the best practices related to the above.&lt;/li>
&lt;/ul>
&lt;h3 id="write-polyphys-technical-story-and-content">Write PolyPhy&amp;rsquo;s technical story and content&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Writing&lt;/code> &lt;code>Documentation&lt;/code> &lt;code>Storytelling&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> experienced writing structured text, well read, technical or scientific education, webdev basics (preferably NodeJS)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350 hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="mailto:ez@nmsu.edu">Ezra Huscher&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Integral to PolyPhy&amp;rsquo;s presentation is a &amp;ldquo;story&amp;rdquo; - a narrative understanding - that the users and the project contributors can relate to. Your responsibility will be to develop the written part of that understanding, as well as major portions of technical documentation that match it.&lt;/p>
&lt;p>&lt;strong>Specific tasks:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Write and edit diverse pages of the project &lt;a href="https://www.polyphy.io" target="_blank" rel="noopener">website&lt;/a>.&lt;/li>
&lt;li>Work with mentors to improve project&amp;rsquo;s written community practices (diversity, communication).&lt;/li>
&lt;li>Write and edit narrative and explanatory parts of PolyPhy&amp;rsquo;s documentation.&lt;/li>
&lt;li>Create tutorials that present core functionality of the toolkit.&lt;/li>
&lt;/ul>
&lt;h3 id="community-engagement-and-management">Community engagement and management&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Community Management&lt;/code> &lt;code>Social Media&lt;/code> &lt;code>Networking&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> documented experience with current social media landscape, social and well spoken, ability to communicate technical concepts&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 175 or 350 hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="mailto:ez@nmsu.edu">Ezra Huscher&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Your responsibility will be to build and engage the community around PolyPhy. This includes its standing team and stakeholders, current expert users, potential adopters as well as the general public. The scope (size) of the project depends on the level of commitment during and beyond the Summer and is negotiable upfront.&lt;/p>
&lt;p>&lt;strong>Specific tasks:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Manage the team&amp;rsquo;s communication channels (Slack, Zoom, email) and maintain active presence therein.&lt;/li>
&lt;li>Develop social media presence for PolyPhy on Twitter, LinkedIn and other selected social media platforms.&lt;/li>
&lt;li>Manage and extend the online presence for the project, including its &lt;a href="https://polyphy.io" target="_blank" rel="noopener">website&lt;/a>, mailing list, and other applicable outreach activities.&lt;/li>
&lt;li>Research and engage with new communities that would benefit from PolyPhy, both as its expert users and contributors.&lt;/li>
&lt;/ul></description></item><item><title>Apache AsterixDB</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/ucr/asterixdb/</link><pubDate>Mon, 07 Nov 2022 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/ucr/asterixdb/</guid><description>&lt;p>&lt;a href="http://asterixdb.apache.org/" target="_blank" rel="noopener">AsterixDB&lt;/a> is an open source parallel big-data management system. AsterixDB is a well-established Apache project that has beedddn active in research for more than 10 years. It provides a flexible data model that supports modern NoSQL applications with a powerful query processor that can scale to billions of records and terabytes of data. Users can interact with AsterixDB through a power and easy to use declarative query language, SQL++, which provides a rich set of data types including timestamps, time intervals, text, and geospatial, in addition to traditional numerical and Boolean data types.&lt;/p>
&lt;h3 id="geospatial-data-science-on-asterixdb">Geospatial Data Science on AsterixDB&lt;/h3>
&lt;ul>
&lt;li>&lt;em>Topics&lt;/em>: Data science, SQL++, documentation&lt;/li>
&lt;li>&lt;em>Skills&lt;/em>: SQL, Writing, Spreadsheets&lt;/li>
&lt;li>&lt;em>Difficulty&lt;/em>: Medium&lt;/li>
&lt;li>&lt;em>Size&lt;/em>: Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;em>Mentors&lt;/em>: &lt;a href="mailto:eldawy@ucr.edu">Ahmed Eldawy&lt;/a>, &lt;a href="mailto:asevi006@ucr.edu">Akil Sevim&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Build a data science project using AsterixDB that analyzes geospatial data among other dimensions. Use &lt;a href="https://star.cs.ucr.edu/?Chicago%20Crimes#center=41.8313,-87.6830&amp;amp;zoom=11" target="_blank" rel="noopener">Chicago Crimes&lt;/a> as the main dataset and combine with other datasets including &lt;a href="https://star.cs.ucr.edu/?osm21/pois#center=41.8313,-87.6830&amp;amp;zoom=11" target="_blank" rel="noopener">points of interests&lt;/a> &lt;a href="https://star.cs.ucr.edu/?TIGER2018/ZCTA5#center=41.8313,-87.6830&amp;amp;zoom=11" target="_blank" rel="noopener">ZIP Code boundaries&lt;/a>. During this project, we will answer interesting questions about the data and visualize the results such as:&lt;/p>
&lt;ul>
&lt;li>What is the most common crime type in a specific date or over the weekends?&lt;/li>
&lt;li>Where do most of the arrests happen?&lt;/li>
&lt;li>How are the crime rates change over time for different regions?&lt;/li>
&lt;/ul>
&lt;h4 id="the-goals-of-this-project-are">The goals of this project are:&lt;/h4>
&lt;ul>
&lt;li>Understand how to build a scalable data science project using AsterixDB.&lt;/li>
&lt;li>Translate common questions to SQL queries and run them on large data.&lt;/li>
&lt;li>Learn how to visualize the results of queries and present them.&lt;/li>
&lt;li>Write detailed documentation about the process of building a data science application in AsterixDB.&lt;/li>
&lt;li>Improve the documentation of AsterixDB while working in the project to improve the experience for future users.&lt;/li>
&lt;/ul>
&lt;h4 id="machine-learning-integration">Machine Learning Integration&lt;/h4>
&lt;p>As a bonus task, and depending on the progress of the project, we can explore the integration of machine learning with AsterixDB through Python UDFs. We will utilize the AsterixDB Python integration through &lt;a href="https://asterixdb.apache.org/docs/0.9.7/udf.html" target="_blank" rel="noopener">user-defined functions&lt;/a> to connect AsterixDB backend with &lt;a href="https://scikit-learn.org/stable/index.html" target="_blank" rel="noopener">scikit-learn&lt;/a> to build some unsupervised and supervised models for the data. For example, we can cluster the crimes based on their location and other attributes to find interesting patterns or hotspots.&lt;/p></description></item><item><title>FasTensor</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/lbl/fastensor/</link><pubDate>Mon, 07 Nov 2022 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/lbl/fastensor/</guid><description>&lt;p>&lt;a href="https://sdm.lbl.gov/fastensor/" target="_blank" rel="noopener">FasTensor&lt;/a> is a parallel execution engine for user-defined functions on multidimensional arrays. The user-defined functions follow the stencil metaphor used for scientific computing and is effective for expressing a wide range of computations for data analyses, including common aggregation operations from database management systems and advanced machine learning pipelines. FasTensor execution engine exploits the structural-locality in the multidimensional arrays to automate data management operations such as file I/O, data partitioning, communication, parallel execution, and so on.&lt;/p>
&lt;h3 id="continuous-integration">Continuous Integration&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Data Management&lt;/code>, &lt;code>Analytics&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: C++, github&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="mailto:kwu@lbl.gov">John Wu&lt;/a>, &lt;a href="mailto:dbin@lbl.gov">Bin Dong&lt;/a>, &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>Develop a test suite for the public API of FasTensor&lt;/li>
&lt;li>Automate execution of the test suite&lt;/li>
&lt;li>Document the continuous integration process&lt;/li>
&lt;li>Develop performance testing suite&lt;/li>
&lt;/ul></description></item><item><title>FasTensor</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lbl/fastensor/</link><pubDate>Mon, 07 Nov 2022 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lbl/fastensor/</guid><description>&lt;p>&lt;a href="https://sdm.lbl.gov/fastensor/" target="_blank" rel="noopener">FasTensor&lt;/a> is a parallel execution engine for user-defined functions on multidimensional arrays. The user-defined functions follow the stencil metaphor used for scientific computing and is effective for expressing a wide range of computations for data analyses, including common aggregation operations from database management systems and advanced machine learning pipelines. FasTensor execution engine exploits the structural-locality in the multidimensional arrays to automate data management operations such as file I/O, data partitioning, communication, parallel execution, and so on.&lt;/p>
&lt;h3 id="tensor-execution-engine-on-gpu">Tensor execution engine on GPU&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Data Management&lt;/code>, &lt;code>Analytics&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: C++, github&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Difficult&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-wu/">John Wu&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/bin-dong/">Bin Dong&lt;/a>, &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Tensor based computing is needed by scientific applications and now advanced AI model training. Most tensor libraries are hand customized and optimized on GPU, and most of they only serve one kind of application. For example, TensorFlow is only optimized for AI model training. Optimizing generic tensor computing libraries on GPU can benefit wide applications. Our FasTensor, as a generic tensor computing library, can only work efficiently on CPU now. How to run the FasTensor on GPU is still none-explored work. Research and development challenges will include but not limited to: 1) how to maintain structure-locality of tensor data on GPU; 2) how to reduce the performance loss when the structure-locality of tensor is broken on GPU.&lt;/p>
&lt;ul>
&lt;li>Develop a mechanism to move user-define computing kernels onto GPU&lt;/li>
&lt;li>Evaluate the performance of the execution engine&lt;/li>
&lt;li>Document the execution mechanism&lt;/li>
&lt;li>Develop performance testing suite&lt;/li>
&lt;/ul>
&lt;h3 id="continuous-integration">Continuous Integration&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Data Management&lt;/code>, &lt;code>Analytics&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: C++, github&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (300 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-wu/">John Wu&lt;/a>, &lt;a href="mailto:dbin@lbl.gov">Bin Dong&lt;/a>, &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>Develop a test suite for the public API of FasTensor&lt;/li>
&lt;li>Automate execution of the test suite&lt;/li>
&lt;li>Document the continuous integration process&lt;/li>
&lt;/ul></description></item><item><title>Polyphorm / PolyPhy</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/ucsc/polyphorm/</link><pubDate>Mon, 07 Nov 2022 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre22/ucsc/polyphorm/</guid><description>&lt;p>&lt;a href="https://github.com/CreativeCodingLab/Polyphorm" target="_blank" rel="noopener">Polyphorm&lt;/a> is an agent-based system for reconstructing and visualizing &lt;em>optimal transport networks&lt;/em> defined over sparse data. Rooted in astronomy and inspired by nature, we have used Polyphorm to reconstruct the &lt;a href="https://youtu.be/5ILwq5OFuwY" target="_blank" rel="noopener">Cosmic web&lt;/a> structure, but also to discover network-like patterns in natural language data. You can find more details about our research &lt;a href="https://elek.pub/projects/Rhizome-Cosmology" target="_blank" rel="noopener">here&lt;/a>. Under the hood, Polyphorm uses a richer 3D scalar field representation of the reconstructed network, instead of a discrete representation like a graph or a mesh.&lt;/p>
&lt;p>&lt;strong>PolyPhy&lt;/strong> will be a Python-based redesigned version of Polyphorm, currently in the beginning of its development cycle. PolyPhy will be a multi-platform toolkit meant for a wide audience across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. All of the offered projects focus on PolyPhy, with a variety of topics including design, coding, and even research. Ultimately, PolyPhy will become a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.&lt;/p>
&lt;h3 id="develop-website-for-polyphy">Develop website for PolyPhy&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Web Development&lt;/code> &lt;code>Dynamic Updates&lt;/code> &lt;code>UX&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> web development experience, good communicator, (HTML/CSS), (Javascript)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:oelek@ucsc.edu">Oskar Elek&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Develop a clean and welcoming website for the project. The organization needs to reflect the needs of PolyPhy users, but also provide a convenient entry point for interested project contributors. No excessive pop-ups or webjunk.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Port the contents of the &lt;a href="https://github.com/CreativeCodingLab/Polyphorm" target="_blank" rel="noopener">repository page&lt;/a> to a dedicated website.&lt;/li>
&lt;li>Design the structure of the website according to best OS practices.&lt;/li>
&lt;li>Work with the visual designer (see below) in creating a coherent and organic presentation.&lt;/li>
&lt;li>Interactively link important metrics from the project dev environment as well as documentation.&lt;/li>
&lt;/ul>
&lt;h3 id="design-visual-experience-for-polyphys-website-and-presentations">Design visual experience for PolyPhy&amp;rsquo;s website and presentations&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Design&lt;/code> &lt;code>Art&lt;/code> &lt;code>UX&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> vector and bitmap drawing, sense for spatial symmetry and framing, (interactive content creation), (animation)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium (175 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:oelek@ucsc.edu">Oskar Elek&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Develop visual content for the project using its main themes: nature-inspired computation, biomimetics, interconnected structures. Aid in designing visual structure of the website as well as other public-facing artifacts.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Design imagery and other graphical elements to visually (re-)present PolyPhy.&lt;/li>
&lt;li>Work with the technical writer (see below) in designing a coherent story.&lt;/li>
&lt;li>Work with the web developer (see above) in creating a coherent and organic presentation.&lt;/li>
&lt;/ul>
&lt;h3 id="write-polyphys-technical-story-and-content">Write PolyPhy&amp;rsquo;s technical story and content&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Writing&lt;/code> &lt;code>Documentation&lt;/code> &lt;code>Storytelling&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> experienced writing structured text over 10 pages, well read, (technical or scientific education)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:oelek@ucsc.edu">Oskar Elek&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Integral to PolyPhy&amp;rsquo;s presentation is a story that the users and the project contributors can relate to. The objective is to develop the verbal part of that story, as well as major portions of technical documentation that matches it. The difficulty of the project is scalable.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Write different pages of the project website.&lt;/li>
&lt;li>Work with mentors to improve project&amp;rsquo;s written community practices (diversity, communication).&lt;/li>
&lt;li>Write and edit narrative and explanatory parts of PolyPhy&amp;rsquo;s documentation.&lt;/li>
&lt;li>Work with the visual designer (see above) in designing a coherent story.&lt;/li>
&lt;/ul>
&lt;h3 id="video-tutorials-and-presentation-for-polyphy">Video tutorials and presentation for PolyPhy&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Video Presentation&lt;/code> &lt;code>Tutorials&lt;/code> &lt;code>Didactics&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> video editing, creating educational content, communication, (native or fluent in another language)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Easy-Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:oelek@ucsc.edu">Oskar Elek&lt;/a>, &lt;a href="mailto:deehrlic@ucsc.edu">Drew Ehrlich&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Create a public face for PolyPhy that reflects its history, context, and teaches its functionality to users in different degrees of familiarity.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context and history of the project.&lt;/li>
&lt;li>Interview diverse project contributors.&lt;/li>
&lt;li>Create a video documenting PolyPhy&amp;rsquo;s history, with roots in astronomy, complex systems, fractals.&lt;/li>
&lt;li>Create a set of tutorial videos for starting and intermediate PolyPhy users.&lt;/li>
&lt;li>Create an accessible template for future tutorials.&lt;/li>
&lt;/ul>
&lt;h3 id="implement-heterogeneous-data-io-ops">Implement heterogeneous data I/O ops&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O Operations&lt;/code> &lt;code>File Conversion&lt;/code> &lt;code>Numerics&lt;/code> &lt;code>Testing&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, experience working with scientific or statistical data, good debugging skills&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate-Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:oelek@ucsc.edu">Oskar Elek&lt;/a>, &lt;a href="mailto:anishagoel14@gmail.com">Anisha Goel&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>By default, PolyPhy operates with an unordered set of points as an input and scalar fields (float ndarrays) as an output, but others are applicable as well. Design and implement interfaces to load and export different data formats (CSV, OBJ, HDF5, FITS&amp;hellip;) and modalities (points, meshes, density fields). The difficulty of the project can be scaled based on contributor&amp;rsquo;s interest.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Research which modalities are used by members of the target communities.&lt;/li>
&lt;li>Implement modular loaders for the inputs and an interface to PolyPhy core.&lt;/li>
&lt;li>Implement exporters for simulation datasets and visualization captures.&lt;/li>
&lt;li>Write testing code for the above.&lt;/li>
&lt;li>Integrate external packages as necessary.&lt;/li>
&lt;/ul>
&lt;h3 id="setup-cicd-for-polyphy">Setup CI/CD for PolyPhy&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Continuous Integration&lt;/code> &lt;code>Continuous Deployment&lt;/code> &lt;code>DevOps&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> experience with CI/CD, GitHub, Python package deployment&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:oelek@ucsc.edu">Oskar Elek&lt;/a>, &lt;a href="mailto:anishagoel14@gmail.com">Anisha Goel&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The objective is to setup a CI/CD pipeline that automates the build testing and deployment of the software. The resulting process needs to be robust to contributor errors and work in the distributed conditions of a diverse contributor base.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Automate continuous building, testing, merging and deployment for PolyPhy in GitHub.&lt;/li>
&lt;li>Publish the CI/CD metrics and build assets to the project webpage.&lt;/li>
&lt;li>Work with other contributors in educating them about the best practices of using the developed CI/CD pipeline.&lt;/li>
&lt;li>Add support for automated packaging using common management systems (pip, Anaconda).&lt;/li>
&lt;/ul>
&lt;h3 id="refine-polyphys-ui-and-develop-new-functional-elements">Refine PolyPhy&amp;rsquo;s UI and develop new functional elements&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>UI/UX&lt;/code> &lt;code>Visual Experience&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python programming, UI/UX development experience, (knowledge of graphics)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:oelek@ucsc.edu">Oskar Elek&lt;/a>, &lt;a href="mailto:dabramov@ucsc.edu">David Abramov&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The key feature of PolyPhy is its interactivity. By interacting with the underlying simulation model, the user can adjust its parameters in real time and respond to its behavior. For instance, an astrophysics expert can load a dataset of 100k galaxies and reconstruct the large-scale structure of the intergalactic medium. A responsive UI combined with real-time visualization allows them to judge the fidelity of the reconstruction and make necessary changes.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Implement a platform-agnostic UI to house PolyPhy&amp;rsquo;s main rendering context as well as secondary analytics.&lt;/li>
&lt;li>Work with the visualization developer (see below) to integrate the rendering functionality.&lt;/li>
&lt;li>Optimize to UI&amp;rsquo;s performance.&lt;/li>
&lt;li>Test the implementation on different OS platforms.&lt;/li>
&lt;/ul>
&lt;h3 id="create-new-data-visualization-regimes">Create new data visualization regimes&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Interactive Visualization&lt;/code> &lt;code>Data Analytics&lt;/code> &lt;code>3D Rendering&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> basic graphics theory and math, Python, GPU programming, (previous experience visualizing novel datasets)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:oelek@ucsc.edu">Oskar Elek&lt;/a>, &lt;a href="mailto:dabramov@ucsc.edu">David Abramov&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Data visualization is one of the core components of PolyPhy, as it provides a real-time overview of the underlying MCPM simulation. Through the feedback provided by the visualization, PolyPhy users can adjust the simulation model and make new findings about the dataset. Various operations over the reconstructed data (e.g. spatial searching) as well as important statistical summaries also benefit from clear visual presentation.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Develop novel ways of visualizing scientific data in PolyPhy.&lt;/li>
&lt;li>Work with diverse data modalities - point clouds, graphs, scalar and vector fields.&lt;/li>
&lt;li>Add support for visualizing metadata, such as annotations and labels.&lt;/li>
&lt;li>Create UI elements for plotting statistical summaries computed in real-time.&lt;/li>
&lt;/ul>
&lt;h3 id="discrete-graph-extraction-from-simulated-scalar-fields">Discrete graph extraction from simulated scalar fields&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Graph Theory&lt;/code> &lt;code>Data Science&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> good understanding of discrete math and graph theory, Python, (GPU programming)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:oelek@ucsc.edu">Oskar Elek&lt;/a>, &lt;a href="mailto:farhasan@nmsu.edu">Farhanul Hasan&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Develop a custom method for graph extraction from scalar field data produced by PolyPhy. Because PolyPhy typically produces network-like structures, representing these structures as weighted discrete graphs is very useful for efficiently navigating the data. The most important property of this abstracted representation is that it preserves the topology of the base scalar field by navigating the 1D ridges of the scalar field.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Become familiar with different algorithms for graph growing and skeleton extraction.&lt;/li>
&lt;li>Implement the most suitable method in PolyPhy, interpreting the source scalar field as a throughput (transport) network. The weights of the resulting graph need to reflect the source throughputs between the respective node locations.&lt;/li>
&lt;li>Implement common graph operations, e.g. hierarchical clustering and reduction, shortest path between two nodes, range queries.&lt;/li>
&lt;li>Optimize the runtime of the implemented methods.&lt;/li>
&lt;li>Work with the visualization developer (see above) to visualize the resulting graphs.&lt;/li>
&lt;/ul></description></item><item><title>DirtViz 2.0 (2023)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/dirtviz/</link><pubDate>Mon, 07 Feb 2022 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/dirtviz/</guid><description>&lt;p>DirtViz is a project to visualize data collected from sensors deployed in sensor networks. We have deployed a number of sensors measuring qualities like soil moisture, temperature, current and voltage in outdoor settings. This project involves extending our existing visualization stack, DirtViz 1.0 (see github), and expanding it to version 2.0. The project goal is to create a fully-fledged dataviz tool tailored to the types of data collected from embedded systems sensor networks.&lt;/p>
&lt;h3 id="visualize-sensor-data">Visualize Sensor Data&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Data Visualization, Analytics&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> javascript, python, bash, webservers, git, embedded systems&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Easy/Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large, 350 hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/colleen-josephson/">Colleen Josephson&lt;/a>, &lt;a href="mailto:sonaderi@ucsc.edu">Sonia Naderi&lt;/a>, &lt;a href="mailto:sgtaylor@ucsc.edu">Stephen Taylor&lt;/a>, &lt;a href="mailto:jtmadden@ucsc.edu">John Madden&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Refine our web-based visualization tools to easily allow users to zoom in on date ranges, change axes, etc.&lt;/li>
&lt;li>Create a system for remote collaborators/citizen scientists to upload their own data in a secure manner&lt;/li>
&lt;li>Craft an intuitive navigation system so that data from deployment sites around the world can be easily viewed&lt;/li>
&lt;li>Document the tool thoroughly for future maintenance&lt;/li>
&lt;li>If interested, we are also open to you investigating correlations between different data streams and doing self-directed data analysis&lt;/li>
&lt;/ul></description></item></channel></rss>