<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>osre23 | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/osre23/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/osre23/index.xml" rel="self" type="application/rss+xml"/><description>osre23</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Sun, 29 Oct 2023 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>osre23</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/osre23/</link></image><item><title>These 4 new features will change the way you use OpenROAD</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20231029-luarss/</link><pubDate>Sun, 29 Oct 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20231029-luarss/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Welcome to the final blog post for my GSoC’23! Once again, my name is
Jack and I am working under the open-source electronic design automation
project - OpenROAD. We are a fast growing leading open-source
foundational application for semiconductor digital design, as evidenced
from our consistent star growth since inception. You may check us out
at this &lt;a href="https://github.com/The-OpenROAD-Project/OpenROAD/" target="_blank" rel="noopener">link&lt;/a>.
Allow me to share the four significant contributions I made in this GSoC
project.&lt;/p>
&lt;p>&lt;a href="https://star-history.com/#The-OpenROAD-Project/OpenROAD&amp;amp;Date" target="_blank" rel="noopener">
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://api.star-history.com/svg?repos=The-OpenROAD-Project/OpenROAD&amp;amp;type=Date" alt="Star History Chart" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/a>&lt;/p>
&lt;h2 id="1-improving-ease-of-installation">1) Improving Ease of Installation&lt;/h2>
&lt;p>Firstly, OpenROAD is now able to support multiple operating systems.
This is essential as one of our primary goals is to democratise chip
implementation. And installation is often one of the hardest steps
to get right, so that was one of our priorities. Today, we have
provided options for different types of installation:&lt;/p>
&lt;ul>
&lt;li>&lt;em>Prebuilt binaries&lt;/em>: Local installations can often be riddled
with incompatibilities or unexpected bugs, as well as taking a long
compilation time. We sidestepped this by providing semi-regular
updates to OpenROAD binary, reducing the time to installation.&lt;/li>
&lt;li>&lt;em>Docker&lt;/em>: Echoing previous concerns, we also enabled Docker installation
for 9 major operating systems. Docker is extremely flexible and runs
on many operating systems (as long as it is supported by Docker).&lt;/li>
&lt;/ul>
&lt;p>With these changes, we have observed 10% reduction of installation related Github issues posted on a weekly basis.&lt;/p>
&lt;p>
&lt;figure id="figure-figure-1-supported-os-matrix">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Pic1" srcset="
/report/osre23/ucsd/openroad/20231029-luarss/pic1_hu40f387e99db4aa81085a02b3bc75ebae_22326_5ec6a03672875da1d114ed8b24e54d81.webp 400w,
/report/osre23/ucsd/openroad/20231029-luarss/pic1_hu40f387e99db4aa81085a02b3bc75ebae_22326_256594bafdfffa842322c55b991f1ae1.webp 760w,
/report/osre23/ucsd/openroad/20231029-luarss/pic1_hu40f387e99db4aa81085a02b3bc75ebae_22326_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20231029-luarss/pic1_hu40f387e99db4aa81085a02b3bc75ebae_22326_5ec6a03672875da1d114ed8b24e54d81.webp"
width="650"
height="608"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 1: Supported OS matrix
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;h2 id="2-filling-missing-documentation">2) Filling Missing Documentation&lt;/h2>
&lt;p>Next, we have made considerable improvements to over 20 tool-specific
documentations, introducing consistent formatting styles for each page.
We introduce default values and datatypes to allow users to use the
tools with greater ease.&lt;/p>
&lt;p>
&lt;figure id="figure-figure-2-helpful-documentation-defaults-and-datatype">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Pic2" srcset="
/report/osre23/ucsd/openroad/20231029-luarss/pic2_hu909e40a774da931354132b6c4f3b2165_22459_f20854090d02e2c8c4eab994e275b52a.webp 400w,
/report/osre23/ucsd/openroad/20231029-luarss/pic2_hu909e40a774da931354132b6c4f3b2165_22459_2d201fd5ada34b46714b076a84194e28.webp 760w,
/report/osre23/ucsd/openroad/20231029-luarss/pic2_hu909e40a774da931354132b6c4f3b2165_22459_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20231029-luarss/pic2_hu909e40a774da931354132b6c4f3b2165_22459_f20854090d02e2c8c4eab994e275b52a.webp"
width="691"
height="368"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 2: Helpful documentation defaults and datatype
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Rather than having all arguments for a function under a common table,
we separated out into developer arguments and developer commands.
This is to further make our documentation more beginner-friendly to read,
while not alienating our technical userbase. We have also added sections
for example scripts and regression test, so as to help onboard
newcomers to each tool of the flow.&lt;/p>
&lt;p>
&lt;figure id="figure-figure-3-useful-developer-commands-example-scripts-and-regression-test-instructions">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Pic3" srcset="
/report/osre23/ucsd/openroad/20231029-luarss/pic3_huf8b2e6da7ee6998c3390f4691d0458af_30285_e3fcd088f5df4574a67cf6d097c9e73a.webp 400w,
/report/osre23/ucsd/openroad/20231029-luarss/pic3_huf8b2e6da7ee6998c3390f4691d0458af_30285_1ceeb7f590547f00904c173b5a084798.webp 760w,
/report/osre23/ucsd/openroad/20231029-luarss/pic3_huf8b2e6da7ee6998c3390f4691d0458af_30285_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20231029-luarss/pic3_huf8b2e6da7ee6998c3390f4691d0458af_30285_e3fcd088f5df4574a67cf6d097c9e73a.webp"
width="690"
height="670"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 3: Useful developer commands, example scripts, and regression test instructions
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;h2 id="3-extensible-documentation-framework">3) Extensible Documentation Framework&lt;/h2>
&lt;p>Thirdly, we have introduced extensible documentation frameworks.
Now, what do we mean by &lt;em>extensible&lt;/em>? It means we have created an
infrastructure which is easy to use for developers, and allows for
greater maintanability. Our goal is to create something that
requires minimal changes to add content for documentation.&lt;/p>
&lt;p>So, how did we do this?&lt;/p>
&lt;p>We introduced 4 initiatives, namely: the warning/error messages glossary.
We noticed that people were searching for error and warning messages,
but our documentation did not have them. So we added a page where all
the error/warning messages along with relevant code line number can
be generated automatically. On top of that, developers can add useful
debug information to help the end user.&lt;/p>
&lt;p>
&lt;figure id="figure-figure-4-warningerror-messages-glossary">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Pic4" srcset="
/report/osre23/ucsd/openroad/20231029-luarss/pic4_hu4de9242319e92c4f80050403ede9a5eb_17089_aa069c4f5f2d1682fc92525139f6d57c.webp 400w,
/report/osre23/ucsd/openroad/20231029-luarss/pic4_hu4de9242319e92c4f80050403ede9a5eb_17089_881f56c79ec21ee86b422f9eb12ef3c8.webp 760w,
/report/osre23/ucsd/openroad/20231029-luarss/pic4_hu4de9242319e92c4f80050403ede9a5eb_17089_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20231029-luarss/pic4_hu4de9242319e92c4f80050403ede9a5eb_17089_aa069c4f5f2d1682fc92525139f6d57c.webp"
width="687"
height="348"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 4: Warning/Error messages glossary.
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Next, we also introduced automatically generated Doxygen pages, which
integrates nicely into our C++/Tcl source code framework. This automatic
generation will make it much more convenient for developers to just
insert comments into their source code, and allow Doxygen to generate
documentation automatically.&lt;/p>
&lt;p>
&lt;figure id="figure-figure-5-doxygen-pages">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Pic5" srcset="
/report/osre23/ucsd/openroad/20231029-luarss/pic5_hu5bbe3008d2202e9240368dd966dc7b39_37072_567ad1b2725278073bfe8cdf4d2dad6a.webp 400w,
/report/osre23/ucsd/openroad/20231029-luarss/pic5_hu5bbe3008d2202e9240368dd966dc7b39_37072_35b25ed8006816a0cd300dba6aedb4a3.webp 760w,
/report/osre23/ucsd/openroad/20231029-luarss/pic5_hu5bbe3008d2202e9240368dd966dc7b39_37072_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20231029-luarss/pic5_hu5bbe3008d2202e9240368dd966dc7b39_37072_567ad1b2725278073bfe8cdf4d2dad6a.webp"
width="760"
height="578"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 5: Doxygen pages.
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Next, we introduced cloud-based packaging. It is important that our
framework is able to runnable on cloud, and the ever-popular notebook
format. Our Colab based notebook was created with this in mind, and
allows for easy transfer to other notebook providers with some
modifications. Check out the notebooks here!&lt;/p>
&lt;p>
&lt;figure id="figure-figure-6-google-colab-can-now-run-openroad-scripts">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Pic6" srcset="
/report/osre23/ucsd/openroad/20231029-luarss/pic6_hu84acc4eba83f1de30ea399aa678d63ae_48463_0f20b3a36a05036a4602868c18f0da9b.webp 400w,
/report/osre23/ucsd/openroad/20231029-luarss/pic6_hu84acc4eba83f1de30ea399aa678d63ae_48463_125685c82e5be8372c2ae4b937fdd412.webp 760w,
/report/osre23/ucsd/openroad/20231029-luarss/pic6_hu84acc4eba83f1de30ea399aa678d63ae_48463_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20231029-luarss/pic6_hu84acc4eba83f1de30ea399aa678d63ae_48463_0f20b3a36a05036a4602868c18f0da9b.webp"
width="760"
height="321"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 6: Google Colab can now run OpenROAD scripts.
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Lastly, we have the changelog workflow which can be triggered manually.
For our open-source project, we have chosen not to do software releases.
This means it can be difficult to track the changes between commit
numbers. Adding this workflow can help newcomers track the changes
easier, by month.&lt;/p>
&lt;p>
&lt;figure id="figure-figure-7-sample-output-of-github-changelog">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Pic7" srcset="
/report/osre23/ucsd/openroad/20231029-luarss/pic7_hu4e7dae0ef8916646279c834f2bbbed59_40244_a13d29d9b1d8fe53307365f5dfd84d86.webp 400w,
/report/osre23/ucsd/openroad/20231029-luarss/pic7_hu4e7dae0ef8916646279c834f2bbbed59_40244_9baeb333eb95f59c9ac1004e0e9fd54c.webp 760w,
/report/osre23/ucsd/openroad/20231029-luarss/pic7_hu4e7dae0ef8916646279c834f2bbbed59_40244_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20231029-luarss/pic7_hu4e7dae0ef8916646279c834f2bbbed59_40244_a13d29d9b1d8fe53307365f5dfd84d86.webp"
width="760"
height="400"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 7: Sample output of github changelog
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;h2 id="4-openroad-chatbot">4) OpenROAD Chatbot&lt;/h2>
&lt;p>Finally, we are also discussing the potential of creating a chatbot whose
purpose is to answer user queries. We were thinking, there are lots of
domain knowledge in Slack Channels, Github repos, and so on, so why
not create a LLM-based chatbot. Stay tuned for updates!&lt;/p>
&lt;h2 id="personal-reflections">Personal Reflections&lt;/h2>
&lt;p>To me, my most valuable takeaway is with regards to code quality. Often
times, we as coders tend to opt for the best solution and “hack” something
out quickly. Hacking is fine, as a proof of concept - but not for
long term code development. Working in open-source projects like this,
I have learnt to avoid creating unnecessary files, shortening the code
and optimising runtime. In doing our job, we also wish to make life
easier, not harder for future developers&lt;/p>
&lt;h2 id="final-words">Final Words&lt;/h2>
&lt;p>I would like to express my gratitude to my mentors Indira and Vitor for
their guidance and insight throughout the project, as well as the
OpenROAD dev team for their assistance. Would also like to thank the
Google Summer of Code organising committee, and UCSC for creating such a
wonderful program. Being able to contribute to actual real open-source
projects with real needs, is truly the best of both worlds for aspiring
programmers.&lt;/p></description></item><item><title>Final Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/osu/missingsettings/20231030-ren.450/</link><pubDate>Wed, 25 Oct 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/osu/missingsettings/20231030-ren.450/</guid><description>&lt;p>In my final blog, I will first introduce the project, then describe the achievements after the midterm and summarize our experiments. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/osu/missingsettings">Measuring Research Prototypes under Unreported Settings&lt;/a> my &lt;a href="https://drive.google.com/file/d/1ouFre-qMDCL_LiH5jFNUCOI1yAYHdWcS/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yang-wang/">Yang Wang&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/miao-yu/">Miao YU&lt;/a> aims to understand the impact of missing settings in artifact evaluation.&lt;/p>
&lt;p>In my midterm blog(/report/osre23/osu/missingsettings/20230802-ren.450/), I took three paratmeters as the PostgreSQL config to test the performance of TPC-C benchmark and got some initial results about the effect of different parameters separately on throughput performance. After the midterm, I continue doing experiments on these four parameters (shared_buffer, min_wal_size, max_wal_size and effective_cache_size) with more values and associate them to measure the effect on performance. These parameters are related to memory consumption, checkpoints and planner cost in the database server. You can refer to my previous blog for details.&lt;/p>
&lt;p>For the experiment, we continue to measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values except the four parameters we choose to tune. For the shared_buffer parameter, we choose from initial 128mb to 8gb, in total 6 values. Then for each shared_buffer setting, effective_cache_size includes three values, from initial 4gb to 16gb. Next, for each effective_cache_size setting we tune the min_wal_size and max_wal_size as a tuple, min_wal_size has two values and max_wal_size has four values, in total 6 values. We conduct the experiments by running three rounds for each setting and get all three throughput numbers and calculate their average values.&lt;/p>
&lt;p>Based on the &lt;a href="https://docs.google.com/spreadsheets/d/12OeSwZGq2G4-YGY5BTH5uZbVcAaxcZqYhqciCaBiF2E/edit?usp=sharing" target="_blank" rel="noopener">results&lt;/a>, the observation holds as the conclusion from midterm blog. The throughput of the benchmark can be affected by tuning shared_buffer and max_wal_size. Effective_cache_size and min_wal_size do not have obvious effect for this benchmark. The improvement is limited after shared_buffer and max_wal_size reach a certain value.&lt;/p>
&lt;p>In our experiment, we only choose three possible parameters for one benchmark. The experiment is expensive considering the consuming time. There are also more values of above mentioned parameters to test. This experiment can also indicate we may need to sample a subset of settings to generate observations that match those from a full extensive artifact evaluation.&lt;/p></description></item><item><title>Public Artifact and Data Visualization: A Journey to Empower</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20231024-zjyhhhhh/</link><pubDate>Tue, 24 Oct 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20231024-zjyhhhhh/</guid><description>&lt;p>​
Hola Amigos!
​
As we draw the curtains on our project titled &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/intel/artifactviz">Public Artifact and Data Visualization&lt;/a> we&amp;rsquo;re thrilled to present the incredible advancements we&amp;rsquo;ve achieved since our mid-term update. Our mission has been to foster a deeper understanding of data and empower users to make informed decisions. Let&amp;rsquo;s delve into the remarkable evolution of our project.&lt;/p>
&lt;h2 id="unveiling-new-functionalities">Unveiling New Functionalities&lt;/h2>
&lt;ol>
&lt;li>Modular Architecture: Your Way, Your Choice&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>At the core of our project is a modular architecture designed to cater to your unique preferences. We firmly believe that choice empowers users. Thus, we&amp;rsquo;ve given you the option to select between a Graphical User Interface (GUI) and a Command-Line Interface (CLI). It&amp;rsquo;s about providing a platform that adapts to your specific requirements and style of interaction.&lt;/li>
&lt;/ul>
&lt;ol start="2">
&lt;li>Real-time Backend Environment Monitoring: Data as it Happens&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>Real-time monitoring of backend environment data is at the heart of our project. It&amp;rsquo;s not just about collecting data; it&amp;rsquo;s about providing continuous insights into system performance. This feature empowers you to make real-time, data-driven decisions—an essential capability in today&amp;rsquo;s fast-paced computing landscape.&lt;/li>
&lt;/ul>
&lt;ol start="3">
&lt;li>Visualizing Environment Variables: Clarity Amidst Complexity&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>We&amp;rsquo;ve placed a strong emphasis on user-friendly data visualization. Our enhancements enable you to navigate through detected variables effortlessly and compare iterations within different buckets. The result is a visual representation of complex data, making it easier to comprehend and analyze.&lt;/li>
&lt;/ul>
&lt;ol start="4">
&lt;li>Predefined Monitoring Commands: Your Head Start&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>We understand that monitoring can be a daunting task. To simplify the process, we&amp;rsquo;ve introduced predefined monitoring commands such as mpstat and iostat. These templates serve as a launchpad for monitoring common system metrics, helping you get started quickly and efficiently.&lt;/li>
&lt;/ul>
&lt;ol start="5">
&lt;li>Comprehensive Customization: Tailoring the Experience&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>Recognizing that every user has unique needs, our platform now offers extensive documentation. This documentation serves as a guide, enabling users to fine-tune their monitoring commands. It&amp;rsquo;s about tailoring the platform to match your specific requirements and preferences. The power to customize is firmly in your hands.&lt;/li>
&lt;/ul>
&lt;ol start="6">
&lt;li>Import and Export Functionality: Seamless Collaboration&lt;/li>
&lt;/ol>
&lt;ul>
&lt;li>In an era where collaboration and data management are essential, we&amp;rsquo;ve introduced the capability to import and export environment data. This feature simplifies data management and supports collaborative efforts, making it easy to share monitoring data and conduct analysis across various environments.&lt;/li>
&lt;/ul>
&lt;h2 id="exploring-our-repositories">Exploring Our Repositories&lt;/h2>
&lt;p>​
As mentioned earlier, we have completed the core functionalities of our platform, and we would love to have you try it out and provide us with valuable feedback. Here are the links to our repositories where you can explore and experiment with our platform:
​&lt;/p>
&lt;ol>
&lt;li>&lt;a href="https://github.com/PublicExperimentDatabase/PublicExperimentGUI" target="_blank" rel="noopener">GUI Repository&lt;/a> and &lt;a href="https://github.com/PublicExperimentDatabase/PublicExperimentCLI" target="_blank" rel="noopener">CLI Repository&lt;/a>
&lt;ul>
&lt;li>The journey begins with a choice. Our repositories cater to a diverse range of user preferences. Inside the README.md file of the GUI repository, you&amp;rsquo;ll find meticulous installation instructions to guide you through setting up the Graphical User Interface (GUI). It&amp;rsquo;s your portal to a user-friendly experience&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="https://github.com/PublicExperimentDatabase/test-experiment" target="_blank" rel="noopener">Sample Repository&lt;/a>
&lt;ul>
&lt;li>For those eager to embark on their monitoring journey, our Sample Repository is a valuable resource. It provides scripts that not only enable you to run our program but also serve as templates. These templates are designed to simplify the monitoring of your own programs, tailored to your unique requirements.
​&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h2 id="project-demo">Project Demo&lt;/h2>
&lt;p>​
To provide you with a glimpse of what our project can do, here are some demo images showcasing the capabilities and features of &amp;ldquo;Public Artifact and Data Visualization.&amp;rdquo;
​
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature1_hub6eb130c638c788b954d77fd05b17dc2_80420_39e93d5df25c8b9261ed5b60f3a49091.webp 400w,
/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature1_hub6eb130c638c788b954d77fd05b17dc2_80420_435c9e662168ef7e029d1c36702fca84.webp 760w,
/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature1_hub6eb130c638c788b954d77fd05b17dc2_80420_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature1_hub6eb130c638c788b954d77fd05b17dc2_80420_39e93d5df25c8b9261ed5b60f3a49091.webp"
width="760"
height="396"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
​
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature2_hu70b57a5e19005e11cc3a42881b456609_84702_df590742e12a23dea8d1f3414c9e5c16.webp 400w,
/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature2_hu70b57a5e19005e11cc3a42881b456609_84702_b47182cd4c3ea07108c723e7c18875e4.webp 760w,
/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature2_hu70b57a5e19005e11cc3a42881b456609_84702_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature2_hu70b57a5e19005e11cc3a42881b456609_84702_df590742e12a23dea8d1f3414c9e5c16.webp"
width="760"
height="396"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
​
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature3_hu17639a210c97ec1be7d726068aef2aa2_44169_6117bb9125bca9a4f63ad1631b5f7bcc.webp 400w,
/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature3_hu17639a210c97ec1be7d726068aef2aa2_44169_3979a5588d47e6a37a482b5f2184d3af.webp 760w,
/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature3_hu17639a210c97ec1be7d726068aef2aa2_44169_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20231024-zjyhhhhh/feature3_hu17639a210c97ec1be7d726068aef2aa2_44169_6117bb9125bca9a4f63ad1631b5f7bcc.webp"
width="736"
height="656"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
​&lt;/p>
&lt;h2 id="thank-you-for-joining-us">Thank You for Joining Us&lt;/h2>
&lt;p>​
We appreciate your support and participation in this journey of data visualization and empowerment. Our commitment to enhancing the world of data comprehension remains unwavering. As we mark the end of this chapter, we eagerly anticipate the exciting future that awaits in the realm of data visualization. The path doesn&amp;rsquo;t end here; it&amp;rsquo;s just the beginning of a new chapter in our collective exploration of data&amp;rsquo;s potential.`
​&lt;/p></description></item><item><title>Final Blog on Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edunet/20230820-srishti-j18/</link><pubDate>Fri, 20 Oct 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edunet/20230820-srishti-j18/</guid><description>&lt;p>Hello Again!&lt;/p>
&lt;p>I&amp;rsquo;m excited to present my final blog post summarizing the progress and achievements made over the 2023 Summer of Reproducibility Fellowship.I will be sharing the work I&amp;rsquo;ve created for the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/edunet">Teaching Computer Networks with Reproducible Research: Developing a &amp;lsquo;classroom competition&amp;rsquo; for adaptive video delivery&lt;/a>.&lt;/p>
&lt;h2 id="recap-of-the-journey">Recap of the Journey&lt;/h2>
&lt;p>In my &lt;a href="content/report/osre23/nyu/edunet/20230801-Srishti-j18">mid-term&lt;/a> evaluation, I discussed the initial milestones and challenges I encountered during this program. At that point, I studied the key figures from the research paper &amp;lsquo;&lt;a href="https://dl.acm.org/doi/10.1145/2491172.2491179" target="_blank" rel="noopener">Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming&lt;/a>&amp;rsquo;. My primary objectives were to ensure compatibility with both Python 2 and Python 3 and to incorporate an &amp;lsquo;Estimated Download Rate&amp;rsquo; metric into the output file generated by the adaptive video client. Furthermore, I expanded the project to include two crucial visualizations: buffer occupancy vs. time and estimated download rate vs. time.&lt;/p>
&lt;h2 id="final-project-progress">Final Project Progress&lt;/h2>
&lt;p>In the final weeks of my internship, I worked towards my ultimate goal, which was to reproduce existing work and create a clear guide for future students. I aimed to enable them to build upon and improve this work. To achieve this, I created a new experiment using an existing one,&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/nyu/edunet/20230820-srishti-j18/feature1_hu9d52f6d5cbdd1ece23828e42c8b71316_147352_f279db1f4805fb171d3cff4ae4a908dc.webp 400w,
/report/osre23/nyu/edunet/20230820-srishti-j18/feature1_hu9d52f6d5cbdd1ece23828e42c8b71316_147352_9e5cf37b8721460bee97304092b3b9fa.webp 760w,
/report/osre23/nyu/edunet/20230820-srishti-j18/feature1_hu9d52f6d5cbdd1ece23828e42c8b71316_147352_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edunet/20230820-srishti-j18/feature1_hu9d52f6d5cbdd1ece23828e42c8b71316_147352_f279db1f4805fb171d3cff4ae4a908dc.webp"
width="760"
height="442"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>which I titled &amp;ldquo;&lt;a href="https://github.com/Srishti-j18/adaptive-video/blob/68bd537a65eeec0f221ae095b35b18c1e8ffd2ef//notebooks/exec_policy.ipynb" target="_blank" rel="noopener">Compare Adaptive Video Policies&lt;/a>&amp;rdquo;&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/nyu/edunet/20230820-srishti-j18/featured_hu28a6c98f585340505adb453b1827e333_184681_9db9d6a3e27e1f9a70c791dbc5fb72d7.webp 400w,
/report/osre23/nyu/edunet/20230820-srishti-j18/featured_hu28a6c98f585340505adb453b1827e333_184681_12adfaacac4f310f07b71c83727dd13e.webp 760w,
/report/osre23/nyu/edunet/20230820-srishti-j18/featured_hu28a6c98f585340505adb453b1827e333_184681_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edunet/20230820-srishti-j18/featured_hu28a6c98f585340505adb453b1827e333_184681_9db9d6a3e27e1f9a70c791dbc5fb72d7.webp"
width="760"
height="575"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>This experiment compares two policies: rate-based (basic) policy and
buffer-based (Netflix) policy. In the experiment, I covered the following key aspects:&lt;/p>
&lt;p>How Both Policies Work: I detailed the workings of both the rate-based and buffer-based policies, explaining how each policy selects the next bitrate, among other relevant information.&lt;/p>
&lt;p>Instructions for Execution of Policies: After conducting several experiments with different settings, I determined the most appropriate settings for this experiment. These settings have been added to the instructions for executing both policies, with a focus on ensuring similar &amp;ldquo;high&amp;rdquo; network rates, &amp;ldquo;low&amp;rdquo; data rates, similar durations of the &amp;ldquo;high&amp;rdquo; data rate before the
interruption, and similar durations of the &amp;ldquo;interruption.&amp;rdquo; This setup allows for an easy and clear comparison of the two policies.&lt;/p>
&lt;p>Discussion Part: In the discussion section, I addressed the differences that students can observe after conducting the experiment and visualising the graphs and videos.&lt;/p>
&lt;p>In conclusion, I would like to thank my mentor, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a>, who has given me excellent guidance and would like to express my gratitude to OSRE23, where I have learned so much. This experience has been amazing for my personal and professional growth.&lt;/p></description></item><item><title>Final Blog on Using Reproducibility in Machine Learning Education</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20231018-msaeed/</link><pubDate>Wed, 18 Oct 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20231018-msaeed/</guid><description>&lt;p>Welcome back!&lt;/p>
&lt;p>In my final blog post for the 2023 Summer of Reproducibility Fellowship, I&amp;rsquo;ll be sharing my experiences and the materials I&amp;rsquo;ve created for the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml">Using Reproducibility in Machine Learning Education project&lt;/a>. As a quick reminder, my mentor &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> and I have been working on developing interactive open-source educational resources that teach reproducibility and reproducible research in machine learning. You can find my &lt;a href="https://drive.google.com/file/d/13HnCMZawpabiLdBoOiaJFF2mNXIPLCVJ/view?usp=sharing" target="_blank" rel="noopener">proposal here&lt;/a>.&lt;/p>
&lt;p>In this post, I&amp;rsquo;ll give you a rundown of my experience and share the materials I&amp;rsquo;ve created. If you haven&amp;rsquo;t checked out my previous blog posts, definitely take a look before diving into this one. Let&amp;rsquo;s get started!&lt;/p>
&lt;h2 id="why-is-this-project-important-">Why is this project important 🤔&lt;/h2>
&lt;p>Reproducibility is an essential aspect of scientific research, and it&amp;rsquo;s becoming increasingly important in the field of computer science. However, most efforts to promote reproducibility in education focus on students who are actively involved in research, leaving a significant gap in the curriculum for introductory courses. Our project aims to address this issue by incorporating reproducibility experiences into machine learning education.&lt;/p>
&lt;h2 id="why-reproducibility-matters-in-education-">Why Reproducibility Matters in Education 🎓&lt;/h2>
&lt;p>There are two primary reasons why we believe reproducibility belongs in the computer science classroom. Firstly, it allows students to experience the process of reproducing research firsthand, giving them a deeper understanding of the scientific method and its importance in the field. This exposure can inspire students to adopt reproducible practices in their future careers, contributing to a more transparent and reliable scientific community.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/nyu/eduml/20231018-msaeed/reproducibilityBenifits_hudb7baafb83412fd51973fc577da0863d_141778_0f4e9e6ba00e070430ccd90e09800a28.webp 400w,
/report/osre23/nyu/eduml/20231018-msaeed/reproducibilityBenifits_hudb7baafb83412fd51973fc577da0863d_141778_22ad37d3ef94bfc2aa93cf4ba651684e.webp 760w,
/report/osre23/nyu/eduml/20231018-msaeed/reproducibilityBenifits_hudb7baafb83412fd51973fc577da0863d_141778_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20231018-msaeed/reproducibilityBenifits_hudb7baafb83412fd51973fc577da0863d_141778_0f4e9e6ba00e070430ccd90e09800a28.webp"
width="760"
height="207"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;em>Source: Fund, Fraida. &amp;ldquo;We Need More Reproducibility Content Across the Computer Science Curriculum.&amp;rdquo; Proceedings of the 2023 ACM Conference on Reproducibility and Replicability. 2023.&lt;/em>&lt;/p>
&lt;p>Secondly, as shown in the figure, involving students in reproducibility efforts can have a significant impact on the reproducibility ecosystem itself. Students can create reproducibility artifacts, such as replicable experiments or data analysis, that can be used by other researchers, including authors and graduate students. Additionally, students can consume reproducibility artifacts created by the research community, provide feedback, and suggest improvements. Authors appreciate this type of engagement, as it adds value to their work and promotes open science.&lt;/p>
&lt;h2 id="focusing-on-machine-learning-">Focusing on Machine Learning 🧐&lt;/h2>
&lt;p>Given the growing interest in machine learning and its relevance to reproducibility, our project decided to focus on this area. Machine learning already has a strong culture of reproducibility, with initiatives like &lt;a href="https://paperswithcode.com/" target="_blank" rel="noopener">Papers with Code&lt;/a> and the &lt;a href="https://paperswithcode.com/rc2022" target="_blank" rel="noopener">ML Reproducibility Challenge&lt;/a>. These efforts encourage researchers to share their code and reproduce recent machine learning papers, validating their results. By leveraging these existing resources, we can create learning materials that utilize real-world examples and foster hands-on reproducibility experiences for students.&lt;/p>
&lt;h2 id="the-interactive-notebooks-">The Interactive Notebooks 📖&lt;/h2>
&lt;p>We have created two learning materials that focus on machine learning and reproducibility. &lt;strong>The first material&lt;/strong> looks at a paper titled &lt;a href="https://arxiv.org/abs/1910.08475" target="_blank" rel="noopener">&amp;ldquo;On Warm Starting Neural Network Training&amp;rdquo;&lt;/a> by Jordan T. Ash and Ryan P. Adams. This paper discusses the concept of warm-starting, which involves using weights from a previously trained model on a subset of the dataset to train a new model. The authors compare the performance of warm-started models with randomly initialized models and find that the warm-started models perform worse as shown in the below figure.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/nyu/eduml/20231018-msaeed/figure1_hu9d945c7dbee9ef6ef608a89a33d817c5_76602_5815498dd015ebc84b00505c90a65354.webp 400w,
/report/osre23/nyu/eduml/20231018-msaeed/figure1_hu9d945c7dbee9ef6ef608a89a33d817c5_76602_df01d4772e731cee04ae4783ac0cc994.webp 760w,
/report/osre23/nyu/eduml/20231018-msaeed/figure1_hu9d945c7dbee9ef6ef608a89a33d817c5_76602_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20231018-msaeed/figure1_hu9d945c7dbee9ef6ef608a89a33d817c5_76602_5815498dd015ebc84b00505c90a65354.webp"
width="760"
height="306"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Our material takes students through the process of identifying the different claims made in the paper and finding the corresponding experiments that support them. They will also learn how to use open-source code and available data to reproduce these experiments and understand the computational complexity associated with reproducing each experiment. This material can be found on both &lt;a href="https://github.com/mohammed183/re_warm_start_nn/tree/main" target="_blank" rel="noopener">github&lt;/a> and &lt;a href="https://chameleoncloud.org/experiment/share/5b5717df-9aa9-470f-b393-c1e189c008a8" target="_blank" rel="noopener">chameleon&lt;/a> where you can use chameleon to run the material on the required resources.&lt;/p>
&lt;p>&lt;strong>The second material&lt;/strong> examines the paper &lt;a href="https://arxiv.org/abs/2010.11929" target="_blank" rel="noopener">&amp;ldquo;An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale&amp;rdquo;&lt;/a> by Dosovitskiy et al., which introduces a novel way of applying the transformer architecture, which was originally designed for natural language processing, to image recognition tasks. The paper shows that transformers can achieve state-of-the-art results on several image classification benchmarks, such as ImageNet, when trained on large-scale datasets as shown in the following table.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/nyu/eduml/20231018-msaeed/table1_hu639b2ac18dac1313dd35f10cc0ae8db7_237634_7faf7451ff08ac87e7d12ab941c77f8e.webp 400w,
/report/osre23/nyu/eduml/20231018-msaeed/table1_hu639b2ac18dac1313dd35f10cc0ae8db7_237634_5441e5e4c6ffed9b29244a3a3dcde852.webp 760w,
/report/osre23/nyu/eduml/20231018-msaeed/table1_hu639b2ac18dac1313dd35f10cc0ae8db7_237634_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20231018-msaeed/table1_hu639b2ac18dac1313dd35f10cc0ae8db7_237634_7faf7451ff08ac87e7d12ab941c77f8e.webp"
width="760"
height="354"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Our material guides students through the process of understanding which claims can and cannot be validated based on the available datasets and how complex it can be to validate each claim. Additionally, they will learn how to use pre-trained models to replicate computationally expensive experiments. Again this material can be on both &lt;a href="https://github.com/mohammed183/re_vit/tree/main" target="_blank" rel="noopener">github&lt;/a> and &lt;a href="https://chameleoncloud.org/experiment/share/8f0e34c5-d2c4-45be-8425-36686ad57650" target="_blank" rel="noopener">chameleon&lt;/a>.&lt;/p>
&lt;p>Both materials are designed to be easy to understand and interactive, allowing students to engage with the content and gain a deeper understanding of the concepts. Instructors can use these materials to assess their students&amp;rsquo; understanding of machine learning and reproducibility.&lt;/p>
&lt;h2 id="reflecting-on-the-journey">Reflecting on the Journey&lt;/h2>
&lt;p>As we wrap up our journey of creating beginner-friendly learning materials for machine learning using reproducibility, it&amp;rsquo;s time to reflect on the rewarding experiences and valuable lessons learned along the way. Our deep dive into the world of machine learning and reproducibility not only enriched our knowledge but also provided us with an opportunity to contribute to the community at the &lt;strong>UC Open Source Symposium 2023&lt;/strong> at UCSC.&lt;/p>
&lt;p>The symposium was a memorable event where we presented our work in a poster session. The diversity of the audience, ranging from professors and researchers to students, added depth to our understanding through their valuable feedback and insights. It was intriguing to see the potential applications of our work in various contexts and its capacity to benefit the broader community.&lt;/p>
&lt;p>This project has been a personal journey of growth, teaching me much more than just machine learning and reproducibility. It honed my skills in collaboration, communication, and problem-solving. I learned to distill complex ideas into simple, accessible language and create engaging, interactive learning experiences. The most fulfilling part of this journey has been seeing our work come alive and realizing its potential to positively impact many people. The gratification that comes from creating something useful for others is unparalleled, and we are thrilled to share our materials with the world.&lt;/p>
&lt;p>Your time and interest in our work are greatly appreciated! Hope you enjoyed this blog!&lt;/p></description></item><item><title>GPU Emulator for Easy Reproducibility of DNN Training -- Final Blog Post</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20231006-haoranwu/</link><pubDate>Fri, 06 Oct 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20231006-haoranwu/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>For the second half of the project, I spent some time reproducing the figures and then focused on hacking the source code of PyTorch to distinguish the Inter-GPU Computation (1GPU vs. 2GPUs).&lt;/p>
&lt;h4 id="summarization">Summarization&lt;/h4>
&lt;ul>
&lt;li>Finished reproducing figure 3, 4, 5, 6 from &lt;a href="https://ospo.ucsc.edu/project/osre23/utexas/gpuemulator" target="_blank" rel="noopener">GPU Emulator for Easy Reproducibility of DNN Training&lt;/a>.&lt;/li>
&lt;li>Explored into inter-GPU computation in order to reproduce figure 9.&lt;/li>
&lt;/ul>
&lt;h4 id="reporsitory-of-reproducing-figures">Reporsitory of Reproducing Figures&lt;/h4>
&lt;p>I have placed the repository of the deliverable here: &lt;a href="https://github.com/FarFlyField/OSRE_DELIVERABLE/tree/main" target="_blank" rel="noopener">https://github.com/FarFlyField/OSRE_DELIVERABLE/tree/main&lt;/a>
To use the repository, you can use any machine with just CPU, or you may check the results by renting GPU from Chameleon, comparing the result with the emulator&amp;rsquo;s.&lt;/p>
&lt;p>The repository covers how to setup and understand the data produced from it. You will need to understand the spreadsheet and some of the graphing files.&lt;/p>
&lt;h4 id="study-of-inter-gpu-computation">Study of Inter-GPU Computation&lt;/h4>
&lt;p>I have dissected the source code in PyTorch to identify the computation time differences between using 1 GPU and 2 GPUs (inter-GPUs computation time). The one most significant difference is managed throughout the forward process. Here are a few most significant features that make the computation time longer when using 2 GPUs:&lt;/p>
&lt;ul>
&lt;li>When using 1 GPU, PyTorch would put the images used to train the model onto the GPU in the main application once and for all. However, using 2 GPUs, PyTorch will transfer the images before running forward using a parallel function. The function contains two features:
&lt;ul>
&lt;li>It dissects the images into multiple sections and puts them onto the GPUs respectively.&lt;/li>
&lt;li>It copied the model we need to train into the # of GPUs and create threads to train the models parallelly on these separate GPUs.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>These two steps create a major time difference for the computation time we have for training multiple GPUs and also make the transfer time for using 2 GPUs smaller because we are counting the transferring time of images toward computation time.&lt;/li>
&lt;li>After finishing running forward, the parallel function will gather the outputs from the two GPUs and send the output to the first GPU.&lt;/li>
&lt;/ul>
&lt;p>After gathering the output to the first GPU, the code will train the next batch and repeat the steps of transferring data, copying images, running parallel forwarding, and gathering the outputs once again.&lt;/p>
&lt;p>The second significant difference that I’m working on right now is when PyTorch runs backward functions, which are more or less similar to forward but not the same at all. I have located the function loss.backward() function in our application code as the only contributor to the time difference in computation time. Here are a few tasks I did after locating it:&lt;/p>
&lt;ul>
&lt;li>Recorded the functions’ call stack when using 1 GPU and 2 GPUs.&lt;/li>
&lt;li>Recorded the time spent in the functions in the call stack of the functions.&lt;/li>
&lt;li>Identified the inconsistency when measuring data, repeated and verified until the consistency is reached.&lt;/li>
&lt;/ul>
&lt;p>I have finished the basic measuring and drafting out the call stack but I haven’t figured out the exact differences. Because most of the functions are done in C++, printing out the inputs to evaluate the functions will be slightly harder but doable.&lt;/p>
&lt;p>The data recorded and analyzed are placed here:
&lt;a href="https://docs.google.com/spreadsheets/d/1vFj-UE3mjtsHIc5OesKX1sDvr6fpPwtPUMl0pM3V8SA/edit?usp=sharing" target="_blank" rel="noopener">https://docs.google.com/spreadsheets/d/1vFj-UE3mjtsHIc5OesKX1sDvr6fpPwtPUMl0pM3V8SA/edit?usp=sharing&lt;/a>
Summarized doc:
&lt;a href="https://docs.google.com/document/d/10XWNwCZ3kLzy4i6WgJ6KPsujEs2X1gzXblZtUoqMuJw/edit" target="_blank" rel="noopener">https://docs.google.com/document/d/10XWNwCZ3kLzy4i6WgJ6KPsujEs2X1gzXblZtUoqMuJw/edit&lt;/a>&lt;/p></description></item><item><title>Learning Machine Learning by Reproducing Vision Transformers</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20231006-msaeed/</link><pubDate>Fri, 06 Oct 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20231006-msaeed/</guid><description>&lt;p>Hello again!&lt;/p>
&lt;p>In this blog post, I will be discussing the second material I created for the 2023 Summer of Reproducibility Fellowship. As you may recall from my &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230601-msaeed">first post&lt;/a>, I am working on the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml">Using Reproducibility in Machine Learning Education&lt;/a> project with &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> as my mentor. My goal is to create interactive open-source educational resources that teach reproducibility and reproducible research in machine learning (ML), as outlined in my &lt;a href="https://drive.google.com/file/d/13HnCMZawpabiLdBoOiaJFF2mNXIPLCVJ/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>.&lt;/p>
&lt;p>In this post, I will share with you my second material, and how it can be helpful in machine learning class to teach students about vision transformers and reproducibility at the same time. If you haven&amp;rsquo;t seen my first work, be sure to check out my &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230802-msaeed">previous blog post&lt;/a>. Without further ado, let&amp;rsquo;s dive in!&lt;/p>
&lt;h2 id="reproducing-an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale">Reproducing “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”&lt;/h2>
&lt;p>This material is a reproduction of Dosovitskiy et al.‘s 2020 paper, &lt;a href="https://arxiv.org/abs/2010.11929" target="_blank" rel="noopener">“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”&lt;/a>. This paper introduces the Vision Transformer (ViT), a novel architecture that applies the transformer model, originally designed for natural language processing tasks, to image recognition. The ViT model achieves state-of-the-art performance on several image classification benchmarks, demonstrating the potential of transformers for computer vision tasks.
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/nyu/eduml/20231006-msaeed/ViT_hud9ed20979bb56dae4d8e9f4231875a17_383197_485ae1a0cccbdc73994be22901c125d5.webp 400w,
/report/osre23/nyu/eduml/20231006-msaeed/ViT_hud9ed20979bb56dae4d8e9f4231875a17_383197_f8af78acab4a91489ecff3308bc9c9c1.webp 760w,
/report/osre23/nyu/eduml/20231006-msaeed/ViT_hud9ed20979bb56dae4d8e9f4231875a17_383197_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20231006-msaeed/ViT_hud9ed20979bb56dae4d8e9f4231875a17_383197_485ae1a0cccbdc73994be22901c125d5.webp"
width="760"
height="229"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
The figure illustrates the key idea behind ViT, which is to treat an image as a sequence of patches, similar to how a transformer treats a sentence as a sequence of words. Each patch is flattened into a vector and fed into the transformer encoder, which learns to capture the complex relationships between these patches. The resulting representation is then fed into an MLP head, which produces a final prediction for the image. This approach allows ViT to handle large input images and capture both global context and fine-grained details. ViT models can also be pre-trained on large datasets and fine-tuned on smaller datasets for improved performance.&lt;/p>
&lt;p>To reproduce this paper, I followed a systematic approach to ensure reliable results:&lt;/p>
&lt;ul>
&lt;li>Critically analyze the paper&amp;rsquo;s qualitative and quantitative claims.&lt;/li>
&lt;li>Identify the necessary experiments to verify each claim.&lt;/li>
&lt;li>Determine the required data, code, and hyperparameters for each experiment.&lt;/li>
&lt;li>Utilize pre-trained models for validating claims that require high computational resources.&lt;/li>
&lt;li>Investigate resources shared by the authors, such as code, data, and models.&lt;/li>
&lt;li>Assess the feasibility of verifying different types of claims.&lt;/li>
&lt;li>Design new experiments for validating qualitative claims when certain models or datasets are unavailable.&lt;/li>
&lt;/ul>
&lt;p>I utilized &lt;a href="https://chameleoncloud.org/" target="_blank" rel="noopener">Chameleon&lt;/a> as my platform for conducting and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental environment that supports computer science systems research. It enables users to create and share Jupyter notebooks capable of running Python code on Chameleon’s cloud servers. For this work, a GPU with 24GB or more memory is required to run the notebooks on GPU, which Chameleon offers in its variety of GPUs.&lt;/p>
&lt;p>I have set up a &lt;a href="https://github.com/mohammed183/re_vit" target="_blank" rel="noopener">GitHub repository&lt;/a> where you can access all of my reproduction work. The repository contains interactive Jupyter notebooks that will help you learn more about machine learning and the reproducibility of machine learning research. These notebooks provide a hands-on approach to understanding the concepts and techniques presented in my reproduction work.&lt;/p>
&lt;h2 id="challenges">Challenges&lt;/h2>
&lt;p>Reproducing a paper can be a challenging task, and I encountered several obstacles during the process, including:&lt;/p>
&lt;ul>
&lt;li>The unavailability of pretraining datasets and pretrained models&lt;/li>
&lt;li>Inexact or unspecified hyperparameters&lt;/li>
&lt;li>The need for expensive resources for some hyperparameters&lt;/li>
&lt;li>The use of different frameworks for baseline CNNs and Vision Transformers&lt;/li>
&lt;/ul>
&lt;p>These issues posed significant difficulties in replicating the following table, a key result from the Vision Transformer paper that demonstrates its superiority over prior state-of-the-art models.
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/nyu/eduml/20231006-msaeed/table1_hu639b2ac18dac1313dd35f10cc0ae8db7_237634_7faf7451ff08ac87e7d12ab941c77f8e.webp 400w,
/report/osre23/nyu/eduml/20231006-msaeed/table1_hu639b2ac18dac1313dd35f10cc0ae8db7_237634_5441e5e4c6ffed9b29244a3a3dcde852.webp 760w,
/report/osre23/nyu/eduml/20231006-msaeed/table1_hu639b2ac18dac1313dd35f10cc0ae8db7_237634_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20231006-msaeed/table1_hu639b2ac18dac1313dd35f10cc0ae8db7_237634_7faf7451ff08ac87e7d12ab941c77f8e.webp"
width="760"
height="354"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>To overcome these challenges, I used the same models mentioned in the paper but pretrained on different datasets, experimented with various hyperparameter combinations to achieve the best results, and wrote my own code to ensure that both the baseline and Vision Transformer were fine-tuned using the same framework. I also faced other challenges, which I discussed in my notebooks along with the solutions I applied.&lt;/p>
&lt;h2 id="how-to-use-this-material">How to use this material?&lt;/h2>
&lt;p>This material consists of a series of notebooks that guide you through the paper, its claims, experiments, and results. You will learn how to analyze, interpret, and validate the authors&amp;rsquo; claims. To get started, I recommend briefly skimming the &lt;a href="https://arxiv.org/abs/2010.11929" target="_blank" rel="noopener">original paper&lt;/a> to gain an understanding of the main ideas and public information. This will help you see how the authors could have been more transparent and clear in certain sections. The notebooks provide clear instructions and explanations, as well as details on how I addressed any missing components.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>In this blog post, I&amp;rsquo;ve walked you through the contents of this material and the insights users can gain from it. This material is particularly intriguing as it replicates a paper that has significantly influenced the field of computer vision. The interactive nature of the material makes it not only educational but also engaging and enjoyable. I believe users will find this resource both fun and beneficial.&lt;/p>
&lt;p>I hope you found this post informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for reading and stay tuned for more updates!&lt;/p></description></item><item><title>Final GSoC Blog - Polyglot</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230925-kirandeol/</link><pubDate>Mon, 25 Sep 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230925-kirandeol/</guid><description>&lt;p>As I send in my final work submission for the final GSoC evaluation, I&amp;rsquo;m excited to share with you the progress we&amp;rsquo;ve made this summer (and future plans for Polyglot!). You can view the repository and web app here: &lt;a href="https://polyphyhub.github.io/PolyGlot/" target="_blank" rel="noopener">https://polyphyhub.github.io/PolyGlot/&lt;/a>. As a quick reminder of the project, we sought to extend the Polyglot web app, as developed by Hongwei (Henry) Zhou. For context, the web app follows this methodology:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Given a set of words, use an embedding model (such as Word2Vec, BERT, etc.) to generate a set of high dimensional points associated with each word.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use a dimensionality reduction method (such as UMAP) to reduce the dimensionality of each word-vector point to 3 dimensions&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Use the novel MCPM (Monte Carlo Physarum Machine) to compute the similarities between a set of anchor points and the rest of the point cloud. You could use any similarity metric here, too, such as the Euclidean distance.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The web app then displays the point cloud of 3-dimensional embeddings, but uses coloring to indicate the level of MCPM similarity each word has with the anchor point (e.g, if the anchor point is the word “dog”, the rest of the point cloud is colored such that words identified as similar to “dog” by the MCPM metric are brighter, whereas dissimilar words are darker.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>The main results since the last blog are summarized as follows:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Novel timeline feature in which users can track the importance of certain words over time by watching the change in size of points (computes the IF-IDF metric for a word across all documents in a given year). Uses linear interpolation for years which do not have an explicit importance score.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>An industrial collaboration with UK startup Lautonomy, where we have pre-processed and entered their data into Polyglot. Pre-processing consisted of first computing a high dimensional embedding of their set of words using OpenAI&amp;rsquo;s CLIP model &lt;a href="https://openai.com/research/clip" target="_blank" rel="noopener">https://openai.com/research/clip&lt;/a> and the CLIP-as-service Python package &lt;a href="https://clip-as-service.jina.ai" target="_blank" rel="noopener">https://clip-as-service.jina.ai&lt;/a>. Next, we used UMAP to reduce the dimensionality of these embeddings to 3D. We computed the Euclidean distance on this data (in place of MCPM metric). Finally, we formatted the data to enter into Polyglot.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>Although the app has developed a lot over the summer, we are planning to continue working on Polyglot, particularly with respect to one of our original goals: to set up a pipeline from PolyPhy to Polyglot. Unfortunately, with PolyPhy undergoing refactoring this summer, we weren&amp;rsquo;t able to set this pipeline up. However, that is one of our goals for the next few months. We are also moving forward with the industrial collaboration with legal analytics startup Lautonomy. We hope to release an output together soon!&lt;/p>
&lt;p>If you&amp;rsquo;re curious about Polyglot or are interesting in getting involved, please feel free to reach out to myself, Oskar Elek, and Jasmine Otto!&lt;/p></description></item><item><title>noWorkflow as an experiment management tool - Final Report</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230914-jesselima/</link><pubDate>Thu, 14 Sep 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230914-jesselima/</guid><description>&lt;p>This post describes our midterm work status and some achievements we
have made so far in our project
&lt;a href="https://docs.google.com/document/d/1YMtPjZXcgt5eplyxIgQE8IBpQIiRlB9eqVSQiIPhXNU/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>
for
&lt;a href="https://ospo.ucsc.edu/project/osre23/nyu/noworkflow" target="_blank" rel="noopener">noWorkflow&lt;/a>.&lt;/p>
&lt;p>For a more friendly introduction to our work, please, refer to this
&lt;a href="https://github.com/jaglima/noworkflow_usecase/blob/main/README.md" target="_blank" rel="noopener">tutorial
available&lt;/a>.&lt;/p>
&lt;p>Our final code to merge is available in &lt;a href="https://github.com/jaglima/noworkflow/tree/sor_features" target="_blank" rel="noopener">this repository&lt;/a>.&lt;/p>
&lt;h2 id="different-ways-of-managing-experiments">Different ways of managing experiments&lt;/h2>
&lt;p>From our starting point at the midterm, and from our initial aspirations
for the SoR, we kept on track with the goal of adding features to
noWorkflow related to managing DS/ML experimental setups focusing on
reproducibility.&lt;/p>
&lt;p>With the emergence of IA across multiple fields in industry and
academia, the subject of reproducibility has become increasingly
relevant. In [1] we have an
interesting description of the sources of irreproducibility in Machine
Learning. All these sources are present at different stages during the
project's experimental phases and may even persist in production
environments, leading to the accumulation of technical debt
[2]. The problem of
irreproducibility is also discussed in [[3],
[4]], pointing out that the
velocity of deliverances usually comes at the expense of
reproducibility, among other victims.&lt;/p>
&lt;p>The CRISP-DM process as reviewed in
[5] demonstrates that Data
Science experiments follows a typical path of execution. In the same
manner, [[3], [6],
[7]], points out that
Machine Learning pipelines are composed of well-defined layers (or
stages) through its lifecycle. The emergence of IA in real world
applications stressed the almost artisanal ways of creating and managing
analytical experiments and reinforced that there is room to make things
more efficiently.&lt;/p>
&lt;p>In the search for possible approaches to the problem, we came across
several projects that aimed to address these issues. Not surprisingly,
multiple authors pursued the same goal, for instance [[9],
[10]]. In these references,
and confirmed in our survey, we found from targeted solutions to
specific steps in modeling to services aiming for end-to-end AIOps
management. Some are available as software packages, others as SaaS in
cloud environments. In general terms, all of them end up offering
features in different layers of the workflow (i.e. data, feature,
scoring, and evaluation) or with different conceptualizations of
reproducibility/replicability/repeatability as noticed by
[11]. On one hand, this lack of
standards makes any assessment difficult. On the other hand, it suggests
a community in an exploratory process of a hot topic subject.&lt;/p>
&lt;p>Specifically for this project, our focus is in the initial stages of
computational scientific experiments. As studied in [8], in this
phase, experiments are i) implemented by people as prototypes, ii) with
minor focus on pipeline design and iii) in tools like Notebooks, that
mix documentation, visualization and code with no required sequential
structure. These three practices impact reproducibility and efficiency
and are prone to create technical debts. However, tools like noWorkflow
show a huge potential in such scenarios. It is promising because they i)
demands a minimal setup to be functional, ii) works well with almost
nonexistent workflows iii) require minimal additional intrusive code
among the experimental one and iv) integrates well with Notebooks that
are the typical artifact in these experiments.&lt;/p>
&lt;p>According to its core team, the primary goal of noWorkflow is to
&amp;quot;...allow scientists to benefit from provenance data analysis even
when they don't use a workflow system.&amp;quot;. Unlike other tools,
&amp;quot;noWorkflow captures provenance from Python scripts without needing a
version control system or any other environment&amp;quot;. It is particularly
interesting when we are in the scenario described above, where we lack
any structured system at the beginning of experiments. In fact, after
going through the docs, we can verify that noWorkflow provides:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Command-line accessibility&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Seamless integration with Jupyter Notebooks&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Minimal setup requirements in your environment&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Elimination of the need for virtual machines or containers in its
setup&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Workflow-free operation&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Open source license&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Framework-agnostic position&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Finally, in our research, we confirmed that there is an open spot in the
management of scientific experiments that needs to be occupied by
reproducibility. Provenance tools can help the academy and industry
groups in this goal, and in this summer we focused on adding relevant
features to leverage the noWorkflow in this direction.&lt;/p>
&lt;h2 id="different-tools-for-different-needs">Different tools for different needs&lt;/h2>
&lt;p>In our research phase, we didn't find any taxonomy that fully
accommodated our review of different categories of tools providing
reproducibility and experimental management. So, we could describe some
tools in the following categories (freely adapted from this online
references
&lt;a href="https://ml-ops.org/content/mlops-principles" target="_blank" rel="noopener">[here]&lt;/a> and
&lt;a href="https://ambiata.com/blog/2020-12-07-mlops-tools/" target="_blank" rel="noopener">[here]&lt;/a>):&lt;/p>
&lt;p>&lt;strong>Data and Pipeline Versioning&lt;/strong>: Platforms dealing with ingestion,
processing, and exposing of features for model training and inference.
They enable collaboration and discoverability of already existing
Feature Sets throughout the teams and organizations. Provide provenance
and lineage for data in different levels of complexity.&lt;/p>
&lt;p>&lt;strong>Metadata Stores/Experiment Trackers&lt;/strong>: They are specifically built to
store metadata about ML experiments and expose it to stakeholders. They
help with debugging, comparing, and collaborating on experiments. It is
possible to divide them into Experiment Trackers and a Model Registry.
Moreover, there are projects offering reproducibility features like
hyperparameter search, experiment versioning, etc. However, they demand
more robust workflows and are better suited for projects in the
production/monitoring phases.&lt;/p>
&lt;p>&lt;strong>Pipeline frameworks&lt;/strong>: They operate within the realm of production,
similar to Data Engineering workflows. Their usual goal is to allow any
ML/AI products to be served across a wide range of architectures, and
integrate all the low-hanging fruits along the way. For instance,
pipelines adding hyperparameter optimization tasks, experiment tracking
integrations, boilerplate containerized deployment, etc.&lt;/p>
&lt;p>&lt;strong>Deployment and Observability&lt;/strong>: They focus on deploying models for
real-time inference and monitoring model quality once they are deployed
in production. Their aim is to facilitate post-deployment control tasks
such as monitoring feature drifts, conducting A/B testing, facilitating
fast model shifts, and more.&lt;/p>
&lt;p>The most remarkable aspect of this survey is that there are different
tools for different phases in the life cycle of AI products. There are
tools like DVC and Pachyderm that are Metadata Stores, allowing
Experiment Tracking with features of tagging variables, as well as Data
and Pipeline tracking. They are the most similar tools to noWorkflow in
functionality. However, DVC possesses a more complex framework in
dealing with different 'types' of tags, and relies on command line
tools to extract and analyze tagged variables. Also, it depends strongly
on git and replicate the git logics. Pachyderm requires a more
sophisticated setup at the start, relying on containers and a server. It
is an obstacle to small and lean prototypes, requiring installation of a
docker image, and all friction on managing it.&lt;/p>
&lt;p>There are other tools, like MLFlow and Neptune that pose themselves as
Model Experiment Versioning with features of Monitoring and Deployment.
They also have elements of pipeline frameworks, offering full
integration and boiler plates for seamless integration with cloud
platforms.&lt;/p>
&lt;p>Pipelines are a vast field. They are AWS SageMaker, Google Vertex,
DataRobot and Weights &amp;amp; Biases, among others. All of them offer features
helping in all categories, with a strong focus on exploring all
automation that can be offered to the final user, suggesting automatic
parameter tuning, model selection, retraining, data lineage, metadata
storing, etc.&lt;/p>
&lt;p>Finally, Deployment and Observability frameworks are in the deployment
realm, which is another stage far removed from prototypical phases of
experiments. They come into the scene when all experimental and
inferential processes are done, and there is an AI artifact that needs
to be deployed and monitored. Such tools like Seldon, H2O, Datarobot do
this job, again, with some features of Hyperparameter tuning, pipeline
frameworks, data and pipeline tracking.&lt;/p>
&lt;p>In light of this, when considering management and operation of
experiments, we have a reduced sample of alternatives. Among them,
Notebook integration/management are rare. Some of them rely on other
tools like Git or enforces an overhead in the coding/setup with reserved
keywords, tags and managerial workflows that hinder the process.&lt;/p>
&lt;p>At first sight, our &amp;quot;informal&amp;quot; taxonomy positions noWorkflow as a
Data/Pipeline Versioning and Metadata Store/Experiment Tracker. It is
not a Pipeline Framework which works like a building block, facilitating
the integration of artifacts at production stages. It is not a
Deployment and Observability framework, because they are in the
post-deployment realm, which is another stage far removed from
prototypical phases of experiments.&lt;/p>
&lt;h2 id="desiderata">Desiderata&lt;/h2>
&lt;p>As mentioned earlier, a typical workflow in DS/ML projects is well
described by the CRISP-DM [5]
and precede phases of deployment and production in the whole lifecycle
of DS/ML projects.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image1.png" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Fig 1: CRISP-DM example of trajectory through a data science project&lt;/p>
&lt;p>Briefly speaking, a workflow starts when a user creates a Jupyter
Notebook and starts writing code. Usually, he/she imports or selects
data from a source, explore features which are expected to have the
highest inference potential, tunes some parameters to set up its
training, trains and evaluates the predictive power of the model through
different metrics. At this final step, we have delineated a trial. This
trial result can suggest further improvements and new hypotheses about
data, features, model types and hyperparameters. Then, we have a new
experiment in mind that will result in a new trial.&lt;/p>
&lt;p>When this process repeats multiple times, a researcher may end with
different notebooks storing, each one, a different experiment. Each
notebook has multiple hyperparameters, modeling choices and modeling
hypotheses. Otherwise, the experimenter may have a unique notebook where
different experiments were executed, in a nonlinear order between the
cells. This former case is pointed out in
[8], where Notebook flexibility
makes it difficult to understand which execution order resulted in a
specific output.&lt;/p>
&lt;p>In a dream space, any researcher/team would have benefited at most if
they could&lt;/p>
&lt;p>a) in a running Notebook, being able to retrieve all the operations
that contributed to the result of a variable of interest. In this
case, modifications applied in the inputs or in the order of
operations would be easily detectable. In the same way, any
nonlinear execution that interferes in a control result.&lt;/p>
&lt;p>b) Compare trials after different experiments. After experimenting with
different hypotheses about hyperparameters, features or operation
order, the user should easily compare the history of two trials and
spot differences.&lt;/p>
&lt;p>c) Retrieve a target variable among different trials that were executed
in the context of an experiment. After proceeding with multiple
experimental trials, users should be able to compare the results
that are stored in different Notebooks (or even not).&lt;/p>
&lt;p>d) Be as much &amp;quot;no workflow&amp;quot; as possible. All the former requisites
should be possible with minimal code intervention, tags, reserved
words or any active coding effort.&lt;/p>
&lt;p>With these goals in mind, we worked on our deliverables and used the
experiment carried out by [12]
as a guideline to validate the new noWorkflow features.&lt;/p>
&lt;h2 id="deliverables">Deliverables&lt;/h2>
&lt;p>In this session, we will describe what we have implemented during this
summer.&lt;/p>
&lt;p>We started on tagging cells and variables and then navigating through
its pre-dependencies, or all other variables and function calls that
contributed to its final value. This was a fundamental step that allowed
us to evolve to create features that are really useful in day-to-day
practice.&lt;/p>
&lt;p>From the features of tagging a cell and tagging a variable, we evolved
to the following features (an interactive notebook is available here):&lt;/p>
&lt;ul>
&lt;li>&lt;em>backwards_deps('var_name', glanularity_level)&lt;/em> : returns a
dictionary storing operations/functions calls and their associated
values that contributed to the final value of the tagged variable.
Glanularity_level allows to set if the internal operations of the
functions must be included or not.&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image5.png" alt="backwards_deps" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>
&lt;p>&lt;em>global_backwards_deps&lt;/em>('var_name', glanularity_level) : does the
same as backwards_deps, but from all different tagging and
re-tagging events in the notebook. It allows to retrieval of the
complete operation of a tagged variable across all executed cells in
the notebook&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>store_operations(trial_id, dictionary_ops)&lt;/em> : save the current
trial in order to make further comparisons with other experiments.
The dictionaries aren't stored in the &lt;em>.noworkflow/db.sqlite&lt;/em>, but
in a shelve object named *ops.db* in the current notebook local
folder.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>resume_trials()&lt;/em> : to support the management of experiments, the
user can see the trial_ids of all experiments stored in the ops.db
available for comparison/analysis.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;em>trial_intersection_diff(trial_id1, trial_id2)&lt;/em> : all mutual
variables/funcion_calls between two experiments have its scalar
values compared&lt;/p>
&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image2.png" alt="trial_intersection_diff" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>&lt;em>trial_diff(trial_id1, trial_id2)&lt;/em> : The values of variables and
function calls are exhibited in a diff file format, emphasizing the
operations' order. The goal here is to show that between the two
experiments, the order of operations was different. Again, only
scalar values are exhibited. More complex data structures (matrices,
vectors, tensors, etc.) are only signaled as &lt;em>'complex_type'&lt;/em>&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image3.png" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>&lt;em>var_tag_plot('var_name')&lt;/em> : Chart the evolution of a given
variable across multiple trials in the database. In this case, all
experiments stored in ops.db and tagged as *target_var* have their
values plotted&lt;/li>
&lt;/ul>
&lt;blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image4.png" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;/blockquote>
&lt;ul>
&lt;li>&lt;em>var_tag_values('var_name') :&lt;/em> Provides access to pandas.dataframe
var_name entries with correspondent values across different trials.&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/media/image6.png" alt="" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="challenges">Challenges&lt;/h2>
&lt;p>As expected, we had unexpected findings along the project. Bellow, we
delve into the most significant challenges we had to face:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Jupyter notebooks allow a nonlinear execution of small parts of code
through cells. More than once, we had to align about how to create
functionalities to attend different scenarios that were unexpected.
One example was the backwards_deps() and global_backwards_deps()
functions. The latter function was born to cover the case where the
user wants all dependencies rather than the local cell dependencies.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Despite the high quality of the current version of the package, the
project needs documentation, which slows down the analysis of any
new development. In this project, the aid of mentors was crucial at
some points where a deeper knowledge was needed.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>What is the vocation of noWorkflow? At some points in the project,
we had to discuss forcing some kind of workflow over the user. And
it would go against the philosophy of the project.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>When working on comparing results, especially in DS/ML fields,
complex types arise. Numerical vectors, matrices, and tensors from
NumPy and other frameworks, as well as data frames, can't be
properly manipulated based on our current approach.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The dilemma of focusing on graphic visual features versus more
sophisticated APIs. More than once, we needed to choose between
making a visual add-on to Jupyter or implementing a more complete
API.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The current version of Jupyter support in noWorkflow doesn&amp;rsquo;t
integrate well with Jupyter Lab. Also, even the IPython version has
new versions, and noWorkflow needs to adapt to a new version.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="future-improvements">Future Improvements&lt;/h2>
&lt;p>Given our current achievements and the insights gained along the
project, we would highlight the following points as crucial future
roadmap improvements:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Add a complex type treatment for comparisons. Today, visualizing and
navigating through matrices, data frames, tensors, isn't possible
with noWorkflow, although the user can do by its own means.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Integrate the dictionaries storing sequences of operations from
shelve objects to a more efficient way of storage and retrieval.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Make it easier for users to manage (store, retrieve, and navigate)
through different trials.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Add graphical management instead of relying upon API calls only.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Evolve the feature of tagging cells.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>When tagging a model, save its binary representation to be recovered
in the future.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Adding the capability of tracking the local dataset reading.
Currently, it is possible to track changes in the name/path of the
dataset. However, any modification in the integrity of a dataset is
not traceable.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="what-ive-learned">What I've learned&lt;/h2>
&lt;p>This was a great summer with two personal discoveries. The first one was
my first formal contact with the Reproducibility subject. The second was
to fully contribute with an Open Source project. In the research phase,
I could get in touch with the state-of-the-art of reproducibility
research and some of it nuances. In the Open Source contributing
experience, I could be mentored by the core team of the noWorkflow and
exercise all the skills required in doing high level software product.&lt;/p>
&lt;h2 id="acknowledgments">Acknowledgments&lt;/h2>
&lt;p>I would like to thank the organization of Summer of Reproducibility for
aiding this wonderful opportunity for interested people to engage with
Open Source software. Also, thanks to the core team of noWorkflow for
supporting me in doing this work.&lt;/p>
&lt;h2 id="bibliography">Bibliography&lt;/h2>
&lt;p>[1] [O. E. Gundersen, K. Coakley, C. Kirkpatrick, and Y. Gil, &amp;ldquo;Sources
of irreproducibility in machine learning: A review,&amp;rdquo; &lt;em>arXiv preprint
arXiv:2204. 07610&lt;/em>.]&lt;/p>
&lt;p>[2] [D. Sculley &lt;em>et al.&lt;/em>, &amp;ldquo;Machine Learning: The High Interest Credit
Card of Technical Debt,&amp;rdquo; in &lt;em>SE4ML: Software Engineering for Machine
Learning (NIPS 2014 Workshop)&lt;/em>,
2014.]&lt;/p>
&lt;p>[3] [P. Sugimura and F. Hartl, &amp;ldquo;Building a reproducible machine
learning pipeline,&amp;rdquo; &lt;em>arXiv preprint arXiv:1810. 04570&lt;/em>,
2018.]&lt;/p>
&lt;p>[4] [D. Sculley &lt;em>et al.&lt;/em>, &amp;ldquo;Hidden technical debt in machine learning
systems,&amp;rdquo; &lt;em>Adv. Neural Inf. Process. Syst.&lt;/em>, vol. 28,
2015.]&lt;/p>
&lt;p>[5] [F. Martínez-Plumed &lt;em>et al.&lt;/em>, &amp;ldquo;CRISP-DM twenty years later: From
data mining processes to data science trajectories,&amp;rdquo; &lt;em>IEEE Trans. Knowl.
Data Eng.&lt;/em>, vol. 33, no. 8, pp. 3048&amp;ndash;3061,
2019.]&lt;/p>
&lt;p>[6] [N. A. Lynnerup, L. Nolling, R. Hasle, and J. Hallam, &amp;ldquo;A Survey on
Reproducibility by Evaluating Deep Reinforcement Learning Algorithms on
Real-World Robots,&amp;rdquo; in &lt;em>Proceedings of the Conference on Robot
Learning&lt;/em>, L. P. Kaelbling, D. Kragic, and K. Sugiura, Eds., in
Proceedings of Machine Learning Research, vol. 100. PMLR, 30 Oct--01
Nov 2020, pp. 466&amp;ndash;489.]&lt;/p>
&lt;p>[7] [A. Masood, A. Hashmi, A. Masood, and A. Hashmi, &amp;ldquo;AIOps:
predictive analytics &amp;amp; machine learning in operations,&amp;rdquo; &lt;em>Cognitive
Computing Recipes: Artificial Intelligence Solutions Using Microsoft
Cognitive Services and TensorFlow&lt;/em>, pp. 359&amp;ndash;382,
2019.]&lt;/p>
&lt;p>[8] [J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire,
&amp;ldquo;Understanding and improving the quality and reproducibility of Jupyter
notebooks,&amp;rdquo; &lt;em>Empirical Software Engineering&lt;/em>, vol. 26, no. 4, p. 65,
2021.]&lt;/p>
&lt;p>[9] [D. Kreuzberger, N. Kühl, and S. Hirschl, &amp;ldquo;Machine Learning
Operations (MLOps): Overview, Definition, and Architecture,&amp;rdquo; &lt;em>IEEE
Access&lt;/em>, vol. 11, pp. 31866&amp;ndash;31879,
2023.]&lt;/p>
&lt;p>[10] [N. Hewage and D. Meedeniya, &amp;ldquo;Machine learning operations: A
survey on MLOps tool support,&amp;rdquo; &lt;em>arXiv preprint arXiv:2202. 10169&lt;/em>,
2022.]&lt;/p>
&lt;p>[11] [H. E. Plesser, &amp;ldquo;Reproducibility vs. replicability: a brief
history of a confused terminology,&amp;rdquo; &lt;em>Front. Neuroinform.&lt;/em>, vol. 11, p.
76, 2018.]&lt;/p>
&lt;p>[12] [Z. Salekshahrezaee, J. L. Leevy, and T. M. Khoshgoftaar, &amp;ldquo;The
effect of feature extraction and data sampling on credit card fraud
detection,&amp;rdquo; &lt;em>Journal of Big Data&lt;/em>, vol. 10, no. 1, pp. 1&amp;ndash;17,
2023.]&lt;/p></description></item><item><title>KV store final Blog</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230825-manank/</link><pubDate>Fri, 25 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230825-manank/</guid><description>&lt;p>Hello again!
Before we get started, take a look at my previous blogs, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230526-manank">Introduction&lt;/a> and
&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230730-manank">Mid Term&lt;/a>. The goal of the project was to implement io_uring based backend driver for client side, which was at
that time using traditional sockets. The objective was improving performance from the zero copy capabilities of io uring. In the process, I learnt about many things,
about &lt;a href="https://gitlab.com/kinetic-storage/libkinetic/-/tree/develop" target="_blank" rel="noopener">libkinetic&lt;/a> and KV stores in general.&lt;/p>
&lt;p>I started by writing a separate driver using io_uring in libkinetic/src in ktli_uring.c, most of which is similar to the sockets backend in ktli_sockets.c. The only
difference was in the send and receive functions. For more detailed description about the implementation, refer to the mid term blog.&lt;/p>
&lt;p>After the implementation, it was time to put it to test. We ran extensive benchmarks with a tool called &lt;a href="https://fio.readthedocs.io/en/latest/fio_doc.html" target="_blank" rel="noopener">fio&lt;/a>, which
is generally used to run tests on filesystems and other IO related things. Thanks to Philip, who had already written an IO engine for testing kinetic KV store (&lt;a href="https://github.com/pkufeldt/fio" target="_blank" rel="noopener">link&lt;/a>), I didn&amp;rsquo;t have much problem in setting up the testbench. Again thanks to Philip, He set up a ubuntu server with the kinetic server
and gave me access through ssh. We ran extensive tests on that server, with both socket and uring backends, with several different block sizes. The link to the benchmarks sheet can be found &lt;a href="https://docs.google.com/spreadsheets/d/1HE7-KbxSqYZ3vmTZiJYoq21P7zfymU7N/edit?usp=sharing&amp;amp;ouid=116274960434137108384&amp;amp;rtpof=true&amp;amp;sd=true" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;p>We spent a lot of time in reading and discussing the numbers, probably the most time consuming part of the project, we had several long discussions analyzing numbers
and their implications, for example in the initial tests, we were getting very high std dev in mean send times, then we figured it was because of the network
bottleneck, as we were using large block sizes and filling up the 2.5G network bandwidth quickly.&lt;/p>
&lt;p>In conclusion, we found out that there are many other major factors affecting the performance of the KV store, for example the network, and the server side of the KV
store. Thus, though io_uring offers performance benefit at the userspace-kernel level, in this case, there were other factors that had more significant effect than the
kernal IO stack on the client side. Thus, for increasing the performance, we need to look at the server side&lt;/p>
&lt;p>I would like to thank Philip and Aldrin for their unwavering support and in depth discussions on the topic in our weekly meetings, I learned a lot from them
throughout the entire duration of the project.&lt;/p></description></item><item><title>Grammar, Parsers, and Queries</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/livehd/20230819-rbaxt/</link><pubDate>Sat, 12 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/livehd/20230819-rbaxt/</guid><description>&lt;h2 id="update-on-tree-sitter-pyrope">Update on tree-sitter-pyrope&lt;/h2>
&lt;p>The pyrope hardware description language now has syntax highlighting available for neovim users.
The &lt;a href="https://github.com/masc-ucsc/tree-sitter-pyrope" target="_blank" rel="noopener">repository&lt;/a> includes a guide to installing the parser, and activating highlights.
After we have tested the syntax highlighting, a pull request will be made to the &lt;a href="https://github.com/nvim-treesitter/nvim-treesitter" target="_blank" rel="noopener">nvim-treesitter repository&lt;/a>.
In this post, I will outline the highlighting process and reflect on a useful feature of neovim.&lt;/p>
&lt;h3 id="syntax-trees">Syntax Trees&lt;/h3>
&lt;p>The pyrope language is described by a grammar. A grammar is a set of rules that describes the allowed structure of a language.
A parser uses the grammar to generate a syntax tree. For example, consider this line of pyrope code.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">var a:u32 &lt;span class="o">=&lt;/span> &lt;span class="m">0&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Using the pyrope parser, we can get a syntax tree for this statement.
The command &lt;code>tree-sitter parse file.prp&lt;/code> gives us the following output.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">&lt;span class="o">(&lt;/span>statement &lt;span class="o">[&lt;/span>1, 0&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 13&lt;span class="o">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">(&lt;/span>assignment_or_declaration_statement &lt;span class="o">[&lt;/span>1, 0&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 13&lt;span class="o">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> decl: &lt;span class="o">(&lt;/span>var_or_let_or_reg &lt;span class="o">[&lt;/span>1, 0&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 3&lt;span class="o">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> lvalue: &lt;span class="o">(&lt;/span>complex_identifier &lt;span class="o">[&lt;/span>1, 4&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 5&lt;span class="o">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">(&lt;/span>identifier &lt;span class="o">[&lt;/span>1, 4&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 5&lt;span class="o">]))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> type: &lt;span class="o">(&lt;/span>type_cast &lt;span class="o">[&lt;/span>1, 5&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 9&lt;span class="o">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> type: &lt;span class="o">(&lt;/span>primitive_type &lt;span class="o">[&lt;/span>1, 6&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 9&lt;span class="o">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">(&lt;/span>sized_integer_type &lt;span class="o">[&lt;/span>1, 6&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 9&lt;span class="o">])))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> operator: &lt;span class="o">(&lt;/span>assignment_operator &lt;span class="o">[&lt;/span>1, 10&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 11&lt;span class="o">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> rvalue: &lt;span class="o">(&lt;/span>constant &lt;span class="o">[&lt;/span>1, 12&lt;span class="o">]&lt;/span> - &lt;span class="o">[&lt;/span>1, 13&lt;span class="o">])))&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The nvim-treesitter syntax highlighting is based on this tree structure.&lt;/p>
&lt;h3 id="queries">Queries&lt;/h3>
&lt;p>A query is an expression that selects nodes from the tree.
For example,&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">&lt;span class="o">(&lt;/span>complex_identifier &lt;span class="o">(&lt;/span>identifier&lt;span class="o">))&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>matches any identifier that is the child of a complex_identifier.
Color schemes in neovim assign colors to different highlight groups.
So, we can assign highlight groups to tree queries.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">&lt;span class="o">(&lt;/span>constant&lt;span class="o">)&lt;/span> @number
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Now, when a constant shows up in the syntax tree, it will highlight according to the @number group.
Most of the work I did on this project involved studying the pyrope grammar, and writing queries based on it.&lt;/p>
&lt;h2 id="neovim">neovim&lt;/h2>
&lt;p>The text editor &lt;a href="https://neovim.io/" target="_blank" rel="noopener">neovim&lt;/a> is a popular choice among programmers. It allows advanced user control with configuration files.
It also has an active community working on plugins to extend its functionality.
Tools such as lazyvim allow for features like code completion and file management that give neovim the same functionality as IDEs.
However, because neovim configuration is unique to each user, this may make it difficult to reproduce neovim instructions.
For example, Professor Renau was going to test pyrope syntax highlighting in neovim.
However, I did not know what configuration was necessary for him to see highlights in neovim.
While I knew that syntax highlighting worked on my setup, I have lots of configuration files that may have contributed to that success.
There is no guarantee that Professor Renau, or other potential users, have the same neovim configuration that I do.&lt;/p>
&lt;h3 id="nvim_appname">NVIM_APPNAME&lt;/h3>
&lt;p>So, Professor Renau suggested I use the &lt;code>$NVIM_APPNAME&lt;/code> variable to test the process on a fresh configuration.
This feature allows the user to specify the configuration files used to launch neovim.
For example, I installed &lt;a href="https://www.lazyvim.org/" target="_blank" rel="noopener">lazyvim&lt;/a> to the folder &lt;code>~/.config/lazy&lt;/code>. Then, I launched neovim with &lt;code>NVIM_APPNAME=lazy nvim&lt;/code>.
So instead of using my default configuration from &lt;code>~/.config/nvim&lt;/code>, the lazyvim configuration was used.
This allowed me to use a neovim instance that was unaffected by my configuration files.
I was able to preview the process of setting up syntax highlighting from the perspective of a lazyvim user.
Similarly, the process can be done with an empty folder to mimic a brand new neovim installation
The point is, configuration files can impact reproducibility in neovim.
However, this feature allows us to bypass our individual configurations, and create reproducible guidelines.&lt;/p>
&lt;h3 id="conclusion">Conclusion&lt;/h3>
&lt;p>In conclusion, most of my work involved writing queries for the pyrope tree-sitter grammar.
This was for the purpose of syntax highlighting in neovim.
However, an important part of any open source project is communicating the results and providing documentation.
The NVIM_APPNAME feature helps view neovim from the perspective of different users, which helps for writing useful documentation.&lt;/p></description></item><item><title>Reproducible Evaluation of Multi-level Erasure Coding (Midterm)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ornl/multilevelerasure/20230801-zhiyanw/</link><pubDate>Sat, 05 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ornl/multilevelerasure/20230801-zhiyanw/</guid><description>&lt;p>Hi Everyone,&lt;/p>
&lt;p>I hope everything goes well! This is my second blog post for my project &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ornl/MultiLevelErasure">Reproducible Evaluation of Multi-level Erasure Coding&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-bent/">John Bent&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjus-george/">Anjus George&lt;/a>, and Meng Wang. In summary, my project aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations. The details are in this &lt;a href="https://docs.google.com/document/d/1dO1aING1QcSB---XklzUjNz0usVh7qWffVGC3GZq2AE/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>.&lt;/p>
&lt;p>In the course of these few weeks, I&amp;rsquo;ve completed several tasks to achieve the aim of this project, including&lt;/p>
&lt;ul>
&lt;li>Literature Review&lt;/li>
&lt;li>Studying the Erasure Coding Simulator and Creating Reproducible Evaluations, with the following policies
&lt;ul>
&lt;li>Clustered/Declustered Local-level SLEC&lt;/li>
&lt;li>Clustered/Declustered Network-level SLEC&lt;/li>
&lt;li>MLEC with C/C, C/D, D/C, D/D configuration&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="literature-review">Literature Review&lt;/h2>
&lt;p>Prior to developing the simulator, my first step was to delve into various literature related to distinct Erasure Coding policies. To understand a simulator for complex Erasure coding policy such as MLEC, I want to start from the simpler EC policies, and then extend my knowledge to more complex ones such as MLEC. Moreover, I also aimed to contrast the durability of MLEC with other comparable EC policies like LRC in my evaluations, making it vital to understand the implementation of these policies.&lt;/p>
&lt;p>Over the first week, I read several papers regarding different chunk placement policies regarding erasure coding, including LRC (Local Reconstruction Codes), CL-LRC (Combined Locality for Local Reconstruction Codes), SODP (Single Overlap declustered parity), and MLEC (Multi-Level Erasure Coding). These papers offered a fundamental comprehension of each policy, their respective advantages and drawbacks, and their practical usage in production environments.&lt;/p>
&lt;h2 id="simulator-reproduction">Simulator Reproduction&lt;/h2>
&lt;p>After gaining some understanding with the papers I read, I started to study the EC simulator by building the simulator myself. I got the MLEC simulator from the mentors. However, the simulator lacks documentation and guides, making it hard for others to reproduce evaluation results. The simulator is also complicated to understand, as it simulates various EC schemes, chunk placements, and rebuild policies, which results in 13,000 LOC. Therefore, my goal is to understand the design and implementation details of the simulator, after which I will create guides for reproducible evaluations.&lt;/p>
&lt;p>In order to fully understand the simulator, the best way is to rebuild the simulator by myself. The simulator is designed to mimic disk failures over the span of a year under varying chunk placement policies. Once successfully rebuilt, the simulator will enable me to assess the durability of MLEC in relation to other widely-used chunk placement policies. I followed the given simulator and rewrote it on my own in Python.&lt;/p>
&lt;p>Based on the skeleton of the given simulator, I first rebuilt a simple simulator that simulates SLEC (single level erasure coding, in both local and network settings) with clustered parities. With the arguments given, the simulator can run arbitrary numbers of iterations that simulate disk failures in one year. The simulator then collects iterations in which there is a data loss. The ratio of failed iterations to total executed iterations is the durability of the erasure coding policy. This simulation allows us to evaluate the durability of SLEC, laying foundations for later evaluation of MLEC.&lt;/p>
&lt;p>Next, I extended my simulator from local-level SLEC implementation by adding more policies. I began by introducing a network-level SLEC policy with clustered parities. This differs slightly from the local-level EC as it necessitates the consideration of factors like network bandwidth within the simulator.&lt;/p>
&lt;p>In addition, I have delved deeper into simulating declustered parities and successfully discovered a method to simulate disk failures. Basically, the simulator generates failures within a one-year timeframe and subsequently repairs them using priority queues. The disks associated with stripes experiencing the most failures are given the highest repair priority. With this construction, the simulator is capable of simulating local-level declustered parities, with the ability to specify parameters.&lt;/p>
&lt;p>Upon successfully simulating local-level declustered parities, the construction of the simulator for network level declustered parities was rather straightforward. I then validated it using the simulator and math models provided by the mentors. The results perfectly agree with each other, which proves the correctness of my understanding for the SLEC declustered placements. By implementing the simulator myself, I strengthened my understanding of erasure coding designs and the simulation techniques, which equipped me with a solid foundation to continue to reproduce MLEC simulations.&lt;/p>
&lt;p>Based on my knowledge gained from implementing SLEC simulators myself, I then reverse-engineered the MLEC simulator provided by the mentors from their MLEC paper. I choose to start from the simplest policy, which is clustered parities in both levels. After spending a considerable time digging into the simulator source codes, I was able to understand the simulation workflows, different repair methods that it implements, and the splitting method that it uses to simulate high durabilities. I then revised my simulator based on my understanding. I also tried to run a few experiments using the same configuration setups as specified in the paper. The results agree well with those in the paper, which verified the success of my reproducing work.&lt;/p>
&lt;h2 id="technical-issues">Technical Issues&lt;/h2>
&lt;p>In the process of building the MLEC, I&amp;rsquo;ve encountered many issues, conceptual or technical. The mentors are super helpful and responsive in the process, so I was able to have steady progress.&lt;/p>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>Overall, I&amp;rsquo;ve rebuilt a python simulator for various EC policies, and the simulator can successfully reproduce the results from paper.&lt;/p>
&lt;h2 id="next-steps">Next Steps&lt;/h2>
&lt;p>My next step would be to package the simulator into reprodTrovi artifact, so others can reproduce evaluations on performance and durability of various EC policies, in particular MLEC&lt;/p></description></item><item><title>Mid Term Blog : Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230804-indianspeedster/</link><pubDate>Fri, 04 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230804-indianspeedster/</guid><description>&lt;p>Hey,&lt;/p>
&lt;p>I am Shekhar and I am one of several students who are working on developing materials for reproducibility in machine learning education, under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a>. My &lt;a href="https://drive.google.com/file/d/1rCzLGIJ8HYCVjY_MfndgrQjAQa2SQbqZ/view?usp=sharing" target="_blank" rel="noopener">Proposal&lt;/a> aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. Our goal is to help students and researchers (1) understand some of the challenges they may face when trying to reproduce someone else&amp;rsquo;s published result, and (2) in their own publications, to specify the methodology so that the result will be more easily reproduced by others.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>My work is inspired by my participation in the &lt;a href="https://paperswithcode.com/rc2022" target="_blank" rel="noopener">2022 Machine Learning Reproducibility Challenge&lt;/a>, where I was reproducing a result related to bias in hate speech classifiers. The paper seemed at first to have complete methodology details. However, when I tried to implement their approach based on the description of the paper, I realized some important details were missing - for example, in the part where they replaced swear words in the text with other words having similar meaning. I wasn&amp;rsquo;t able to identify the exact list of swear words they used, or what approach they followed if the selected replacement was also a swear word. The choices I made when the authors&amp;rsquo; approach was left ambiguous had a significant impact on the magnitude of the final result.&lt;/p>
&lt;h2 id="milestones-and-accomplishments">Milestones and Accomplishments&lt;/h2>
&lt;p>To inform researchers and students about this problem, I created a fictitious machine learning research paper, and a sequence of accompanying Python notebooks to highlight various choices that can be made to fill in the gaps, and explore how these choices can impact the overall results of the research. Our &amp;ldquo;research paper&amp;rdquo; is about the impact of data augmentation on few-shot learning for intent classification. We implemented a basic data augmentation strategy with synonym replacement using the HWU64 dataset and a BERT classifier, and the results suggest that synonym replacement as a data augmentation technique leads to only minor improvement in accuracy.
In the fictitious paper, we left some of the methodology details ambiguous. When reproducing the results using the accompanying notebooks, the reader follows a &amp;ldquo;Choose Your Own Adventure&amp;rdquo; format, selecting a path through a tree, where each node represents ambiguous methodology details and branches out to different choices that are made at that instance. The leaf nodes will represent the final results, providing insights into the magnitude of the differences resulting from each node selection. Some of the choices that the reader makes are -&lt;/p>
&lt;ul>
&lt;li>what subset of the source dataset to use.&lt;/li>
&lt;li>some of the details of data pre-processing.&lt;/li>
&lt;li>some of the details of the synonym replacement data augmentation strategy.&lt;/li>
&lt;li>some training hyperparameters and the details of the hyperparameter search.&lt;/li>
&lt;/ul>
&lt;p>During the first phase of our project, we have implemented an initial draft of these notebooks, to explore various scenarios and see their impact on results. Next, we will further develop the interactive educational material around them.&lt;/p>
&lt;h2 id="challenges">Challenges&lt;/h2>
&lt;p>During the first half of the project, I faced two main challenges. First, I had to come up with a hypothetical research scenario that was realistic, yet easy for students without much expertise to understand. Attaining the right balance was essential to make it engaging and educational. The second challenge was to deliberately leave some details unclear in a realistic way while ensuring that the choices based on that ambiguity had a significant impact on the results. Fortunately, I had the guidance and support of my mentor, which allowed me to successfully tackle these challenges.&lt;/p>
&lt;p>Throughout this project, I faced various challenges and obstacles, but it turned out to be an incredible learning experience. I had the opportunity to dive deep into the domains of few-shot learning and meta-learning, which were entirely new to me. Moreover, I was able to find ambiguous methodologies present in academic papers and explore diverse scenarios related to them. Looking ahead, I am eager to continue working on this project throughout the summer, as it promises further learning and personal growth.&lt;/p></description></item><item><title>Midpoint Blog Interactive Exploration of High-dimensional Datasets with PolyPhy and Polyglot</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230803-kirandeol/</link><pubDate>Thu, 03 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230803-kirandeol/</guid><description>&lt;p>The last few months of my GSoC project have been very exciting and I hope to share why with you here in this blog post! To briefly summarize, my project has been focused on further developing the Polyglot app, a tool for visualizing 3D language embeddings. One important part of Polyglot is its utilization of the novel MCPM metric, where points are colored according to their MCPM similarity to a user-chosen “anchor point” (e.g., if “hat” is our anchor point, then similar words like “cap” or “fedora” will be colored more prominently).&lt;/p>
&lt;p>The first issue we wanted to tackle was actually navigating the point cloud. With hundreds of thousands of points, it can be difficult to find what you’re looking for! Thus, the first few features added were a search bar for points and anchor points and a “jump to point” feature which changes a user’s center of rotation and “jumps” to a chosen point. There were a few hiccups with implementing these features, mainly due to the large number of points and the particular quirks of the graphics library Polyglot uses. In the end though, these simple features made it feel a lot easier to use Polyglot.&lt;/p>
&lt;p>The next set of features related to our desire to actually annotate the point cloud. Similar to how one might annotate a Google doc (ie., highlight a chunk of text and leave a comment), we wanted to set up something similar, but with points! Indeed, this led to the development of a cool brush tool for coloring points, named and commented annotations (up to 5), a search bar within annotations, and finally a button to export annotations and comments to a CSV.&lt;/p>
&lt;p>The next few weeks are looking bright as we strive to finish the PolyPhy-Polyglot pipeline (a notebook for quickly formatting MCPM data from PolyPhy and getting it into Polyglot). We also hope to add a unique “timeline” feature in which users can analyze sections of the point cloud based on the associated time of each point. Overall, it’s been a very stimulating summer and I’m excited to push this project even further!&lt;/p></description></item><item><title>Reproducible Analysis &amp; Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230803-charishulu/</link><pubDate>Thu, 03 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230803-charishulu/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/">Reproducible Analysis &amp;amp; Models for Predicting Genomics Workflow Execution Time&lt;/a>, our goal was to characterize the tools on genomic workflows in terms of system metrics and data quality to build machine learning models to predict the elapsed time of genomic workflows. While Shayantan (another contributor) did the analysis on data quality metrics, I contributed to the system metrics analysis. We are getting closer to that goal because we have managed to collect datasets and do some analysis.&lt;/p>
&lt;h2 id="steps">Steps&lt;/h2>
&lt;p>In this project, we selected DNA-Seq Pipeline as the workflow to be analyzed. This pipeline consists of four tools for processing single-end reads, namely BWA-mem, Samtool-view, Picard-SortSam, Picard-MarkDuplicates. For each tool we executed it using various configurations and stored system metrics for each execution. To do this, we have to take two steps:&lt;/p>
&lt;ul>
&lt;li>Step 1: Building the tools execution environment.&lt;/li>
&lt;li>Step 2: Developing a program to execute tools using some configurations and collect runtime parameters (eg. CPU, RSS, VSZ, and IO) automatically.&lt;/li>
&lt;/ul>
&lt;h2 id="execution-environment">Execution Environment&lt;/h2>
&lt;p>Tools are executed on Chameleon instances by submitting them using Slurm. The machine used in collecting system metrics is a Haswell instance of the Chameleon Texas server. This instance uses Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz with following detailed specifications.&lt;/p>
&lt;table>
&lt;tr>
&lt;th>Number of CPUs&lt;/th>
&lt;td>48&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Number of threads per core&lt;/th>
&lt;td>2&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Number of cores per socket&lt;/th>
&lt;td>12&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Number of sockets&lt;/th>
&lt;td>2&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>In this experiment, we use n+1 instances, where there are n compute nodes and 1 master node. Each execution is done by submitting a job, which is a tool with a certain configuration, from a master node and it will be processed by one of the compute nodes. In order for the tool to be executed, we need to set the master node to be a common container using NFS. This common container is used to store input files and commands for executing tools so that all nodes can access them without having to download and install them.&lt;/p>
&lt;h2 id="executing-and-collecting-system-metrics">Executing and Collecting System Metrics&lt;/h2>
&lt;p>Tools will be executed in various specific configurations by varying parameters such as input size, number of CPU allocation, memory allocation and threads. For example, for BWA-mem respectively the number of variations in values for the number of CPU allocations, memory allocations, and threads is 5, 4, and 5 using 10 different files so that there are 5 x 4 x 5 x 10 = 1000 configuration combinations. For each configuration will be executed 8 times so that there are 8000 data points. Configuration details can be seen in the following table.&lt;/p>
&lt;table>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>#repetions&lt;/th>
&lt;th>#files&lt;/th>
&lt;th>#allocated CPU&lt;/th>
&lt;th>#allocated memory&lt;/th>
&lt;th>#threads&lt;/th>
&lt;th>total&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>BWA-mem&lt;/th>
&lt;td>8&lt;/td>
&lt;td>10&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8, 16, 32, 64&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Samtool-view&lt;/th>
&lt;td>10&lt;/td>
&lt;td>10&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8, 16, 32, 64&lt;/td>
&lt;td>-&lt;/td>
&lt;td>2000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Picard-Sortsam&lt;/th>
&lt;td>10&lt;/td>
&lt;td>10&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8, 16, 32, 64&lt;/td>
&lt;td>-&lt;/td>
&lt;td>2000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Picard-MarkDuplicates&lt;/th>
&lt;td>10&lt;/td>
&lt;td>10&lt;/td>
&lt;td>2, 4, 8, 16, 32&lt;/td>
&lt;td>8, 16, 32, 64&lt;/td>
&lt;td>-&lt;/td>
&lt;td>2000&lt;/td>
&lt;/tr>
&lt;/table>
&lt;p>Meanwhile, to run the tools, we use the following commands:&lt;/p>
&lt;ul>
&lt;li>BWA-mem&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">&lt;span class="nv">$BWA&lt;/span> mem -t &lt;span class="nv">$threads&lt;/span> &lt;span class="nv">$REF_DIR&lt;/span>/hg19.fa &lt;span class="si">${&lt;/span>&lt;span class="nv">INPUT_DIR&lt;/span>&lt;span class="si">}&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>*.fastq &amp;gt; &lt;span class="si">${&lt;/span>&lt;span class="nv">OUTPUT_DIR&lt;/span>&lt;span class="si">}&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.sam
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>Samtool-view&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">&lt;span class="nv">$SAMTOOLS&lt;/span> view &lt;span class="nv">$INPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.sam -Shb -o &lt;span class="nv">$OUTPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>Picard-SortSam&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">java -jar &lt;span class="nv">$PICARD&lt;/span> SortSam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">CREATE_INDEX&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nb">true&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">INPUT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$INPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">OUTPUT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$OUTPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">SORT_ORDER&lt;/span>&lt;span class="o">=&lt;/span>coordinate &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">VALIDATION_STRINGENCY&lt;/span>&lt;span class="o">=&lt;/span>STRICT
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>
&lt;li>Picard-MarkDuplicates&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">java -jar &lt;span class="nv">$PICARD&lt;/span> MarkDuplicates &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">CREATE_INDEX&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nb">true&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">INPUT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$INPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">OUTPUT&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$OUTPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>.bam &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">METRICS_FILE&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="nv">$OUTPUT_DIR&lt;/span>/&lt;span class="si">${&lt;/span>&lt;span class="nv">sra_id&lt;/span>&lt;span class="si">}&lt;/span>_rmd.txt &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span>&lt;span class="nv">VALIDATION_STRINGENCY&lt;/span>&lt;span class="o">=&lt;/span>STRICT
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In Slurm, each job has a job id. In addition, there is a &lt;code>scontrol listpids&lt;/code> command to see the job id to PID mapping. Using this, we can obtain system metrics for a job by gathering information from the &lt;code>/proc/$PID&lt;/code> system file. Information that can be obtained from it is the use of CPU, physical memory, virtual memory, read bytes, and write bytes at a particular time. So that in collecting this data, we will record these features along with the timestamp at 1 second intervals throughout the execution process.&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>We also have calculated the correlation for each feature with the elapsed time. For BWA-mem, the features that correlate more than absolute of 0.5 are Input size, Average CPU Usage, and Output file size , which is in SAM format. For samtools there are input size, average cpu usage and output size in BAM.
For Sortsam, there are input size, write operation, and BAM output size. For MarkDuplicates, there are input size and BAM output size.&lt;/p>
&lt;table>
&lt;tr>
&lt;th>Features\Tools&lt;/th>
&lt;th>BWA-mem&lt;/th>
&lt;th>Samtool-view&lt;/th>
&lt;th>Picard-SortSam&lt;/th>
&lt;th>Picard-MarkDuplicates&lt;/th>
&lt;/tr>
&lt;tr>
&lt;th>Allocated CPU&lt;/th>
&lt;td>-0.145&lt;/td>
&lt;td>-0.095&lt;/td>
&lt;td>-0.179&lt;/td>
&lt;td>-0.156&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Allocated physical memory&lt;/th>
&lt;td>-0.010&lt;/td>
&lt;td>-0.038&lt;/td>
&lt;td>-0.069&lt;/td>
&lt;td>0.132&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Input size&lt;/th>
&lt;td>&lt;b>0.583&lt;/b>&lt;/td>
&lt;td>&lt;b>0.651&lt;/b>&lt;/td>
&lt;td>&lt;b>0.937&lt;/b>&lt;/td>
&lt;td>&lt;b>0.922&lt;/b>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Threads&lt;/th>
&lt;td>-0.072&lt;/td>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Average CPU&lt;/th>
&lt;td>&lt;b>-0.607&lt;/b>&lt;/td>
&lt;td>&lt;b>-0.567&lt;/b>&lt;/td>
&lt;td>-0.479&lt;/td>
&lt;td>-0.480&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Peak CPU&lt;/th>
&lt;td>-0.175&lt;/td>
&lt;td>0.174&lt;/td>
&lt;td>-0.170&lt;/td>
&lt;td>0.046&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Average RSS&lt;/th>
&lt;td>0.040&lt;/td>
&lt;td>0.034&lt;/td>
&lt;td>0.131&lt;/td>
&lt;td>0.182&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Peak RSS&lt;/th>
&lt;td>0.068&lt;/td>
&lt;td>0.046&lt;/td>
&lt;td>0.314&lt;/td>
&lt;td>0.175&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Average VSZ&lt;/th>
&lt;td>0.032&lt;/td>
&lt;td>-0.349&lt;/td>
&lt;td>-0.127&lt;/td>
&lt;td>0.090&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Peak VSZ&lt;/th>
&lt;td>0.048&lt;/td>
&lt;td>0.074&lt;/td>
&lt;td>-0.130&lt;/td>
&lt;td>0.088&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Write bytes&lt;/th>
&lt;td>0.037&lt;/td>
&lt;td>0.190&lt;/td>
&lt;td>&lt;b>0.735&lt;/b>&lt;/td>
&lt;td>0.244&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Read bytes&lt;/th>
&lt;td>-0.031&lt;/td>
&lt;td>0.109&lt;/td>
&lt;td>0.070&lt;/td>
&lt;td>0.110&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Output SAM size&lt;/th>
&lt;td>&lt;b>0.589&lt;/b>&lt;/td>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Output BAM size&lt;/th>
&lt;td>-&lt;/td>
&lt;td>&lt;b>0.763&lt;/b>&lt;/td>
&lt;td>&lt;b>0.934&lt;/b>&lt;/td>
&lt;td>&lt;b>0.923&lt;/b>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;th>Output BAI size&lt;/th>
&lt;td>-&lt;/td>
&lt;td>-&lt;/td>
&lt;td>0.400&lt;/td>
&lt;td>0.399&lt;/td>
&lt;/tr>
&lt;/table>
&lt;h2 id="future-works">Future Works&lt;/h2>
&lt;p>For further work, we will analyze the correlation between elapsed time and features whose scores are below an absolute 0.5. Because there is a possibility that these features are actually correlated with the elapsed time but do not appear to be correlated because the measurements are made by calculating the overall data. So we also need to calculate the feature correlation for each data grouped by input file. Then, we create a machine learning model to predict elapsed time.&lt;/p></description></item><item><title>[FLASHNET]: Leveraging ML-augmented I/O in Linux</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230802-justin08784/</link><pubDate>Wed, 02 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230802-justin08784/</guid><description>&lt;p>Hello everyone,&lt;/p>
&lt;p>This is my second blog post for SoR 2023. As you may recall from my &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230530-justin08784/">initial blogpost&lt;/a>, I am working on the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet/">Flashnet&lt;/a> project under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haryadi-s.-gunawi/">Haryadi S. Gunawi&lt;/a>.&lt;/p>
&lt;p>I&amp;rsquo;ve been assigned two major tasks under Flashnet:&lt;/p>
&lt;ol>
&lt;li>Perform post-training quantization (PTQ) on existing Flashnet models&lt;/li>
&lt;li>Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication&lt;/li>
&lt;/ol>
&lt;h2 id="task-1-perform-post-training-quantization-ptq-on-existing-flashnet-models">Task 1: Perform post-training quantization (PTQ) on existing Flashnet models&lt;/h2>
&lt;p>Since all of our models are currently built using the keras API, I decided to use the tensorflow-lite library, which supports direct conversion. Unfortunately, I encountered several persistent bugs while attempting to apply full-integer quantization on our binary neural network model:&lt;/p>
&lt;h3 id="shapedimension-distortion">Shape/dimension distortion:&lt;/h3>
&lt;p>Bug description: The quantized tflite model produces outputs of shape (8, 1) –– same as input shape–– when the original model produces single-value outputs (1, 1).&lt;/p>
&lt;p>Status: Resolved&lt;/p>
&lt;ul>
&lt;li>The original model has an input dimension of 8 for each input/x-value and there could be several inputs grouped in a single batch.&lt;/li>
&lt;li>Input/batch size is also determined implicitly in the normalization layer of the original model&lt;/li>
&lt;li>However, the &amp;ldquo;interpreter&amp;rdquo; in the quantized model runs inference one by one, and so batch size needs to be explicitly set to &amp;ldquo;1&amp;rdquo; i.e. a shape of single input, (1,8)&lt;/li>
&lt;li>Doing so resolves the model distortion&lt;/li>
&lt;/ul>
&lt;h3 id="incorrect-y-value-range">Incorrect y-value range:&lt;/h3>
&lt;p>Bug description: There are no variation in the quantized model outputs (i.e. it spits out the same value for each input row)&lt;/p>
&lt;p>In the original model, each inference output is a floating point value between 0 and 1. Outputs also vary according to input. This output is rounded towards 0 or 1 using a 0.5 standard cutoff (i.e. x &amp;gt; 0.5 → x = 1). Since the quantized model condenses 32-bit floats into 8-bit integers, we should expect a similar variation in output values across an 8-bit integer range.&lt;/p>
&lt;p>Printing the quantized model weights, I discovered that weight burst/exploding gradient may be occur during quantization process i.e. the values of weights are exploding to infinity or vanishing to 0, and therefore unable to deliver any meaningful value. The likely consequence of this is that the inference output always equals the bias matrix (since the Wx term in y = Wx + B gets zeroed out).&lt;/p>
&lt;p>Status: Open&lt;/p>
&lt;ul>
&lt;li>Multiple potential causes were considered, without any success:
&lt;ul>
&lt;li>Improper quantization of inputs/outputs&lt;/li>
&lt;li>Insufficient training time/number of epochs&lt;/li>
&lt;li>Incompatible model type/structure&lt;/li>
&lt;li>Incompatible tensorflow-lite version&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>At this point, I concluded that tensorflow-lite is too bug-ridden to make making any further attempts with the library not worthwhile.&lt;/li>
&lt;/ul>
&lt;h2 id="task-2-implement-a-rocksdb-client-to-interface-with-the-flashnet-kernel-with-3-way-replication">Task 2: Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication&lt;/h2>
&lt;p>rocksdb is an embedded database for key-value data. Our Flashnet team is currently implementing a Flashnet client in ceph, and so they have tasked me to explore an implementation in rocksdb as an alternative.&lt;/p>
&lt;p>I&amp;rsquo;ve started on this segment of the project only recently, so my current work is still in its formative stages. As of writing, I&amp;rsquo;ve been primarily concerned with setup of software (on a new chameleon instance), running toy db examples, and educating myself on basic terminology/rocksdb documentation.&lt;/p>
&lt;h2 id="future-work">Future work&lt;/h2>
&lt;p>I expect to continue working on Task 1 (do quantization from ground-up or use a different library) and Task 2 as detailed above. I also hope to implement a transformer-based model to supplement our existing suite of Flashnet models.&lt;/p></description></item><item><title>[Midterm] FlashNet: Towards Reproducible Continual Learning for Storage System</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230807-rannnayy/</link><pubDate>Wed, 02 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230807-rannnayy/</guid><description>&lt;h2 id="mid-term-report">Mid-Term Report&lt;/h2>
&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet">FlashNet&lt;/a> my &lt;a href="https://drive.google.com/file/d/1EhJm3kqrpybOkpXiiRMfqVxGeKe9iIsh/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haryadi-s.-gunawi/">Haryadi S. Gunawi&lt;/a> and &lt;strong>Daniar Kurniawan&lt;/strong> aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques. We focus on predicting I/Os latency to decide whether or not the I/O should be failovered to other SSD. The following sections elaborates the work description, major milestones achieved, accomplishments, and challenges during the first half of summer.&lt;/p>
&lt;h2 id="work-description-major-milestones-achieved-and-accomplishments">Work Description, Major Milestones Achieved, and Accomplishments&lt;/h2>
&lt;p>For the first half of the summer, I implemented continual learning pipeline of the model and several drift detection algorithms. After that, I evaluated the effectiveness. Below are the detailed description for each subtask.&lt;/p>
&lt;h3 id="1-continual-learning-pipeline">1. Continual Learning pipeline&lt;/h3>
&lt;p>Firstly, I designed the pipeline. As shown on the graph below, the pipeline contains 4 main modules, namely initial train, retrain, inference, and monitor.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Pipeline Flowchart" srcset="
/report/osre23/uchicago/flashnet/20230807-rannnayy/cl-pipeline_hubf4c27ce042fa200bb9ef46ed6f9b5dd_194399_2067e763ad30087275106bc5b2921a5a.webp 400w,
/report/osre23/uchicago/flashnet/20230807-rannnayy/cl-pipeline_hubf4c27ce042fa200bb9ef46ed6f9b5dd_194399_fcd6d4a25c164fcfc872329662c36fa5.webp 760w,
/report/osre23/uchicago/flashnet/20230807-rannnayy/cl-pipeline_hubf4c27ce042fa200bb9ef46ed6f9b5dd_194399_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230807-rannnayy/cl-pipeline_hubf4c27ce042fa200bb9ef46ed6f9b5dd_194399_2067e763ad30087275106bc5b2921a5a.webp"
width="760"
height="249"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The modules were first developed in Python using linear regression model. Turns out, linear regression model is not good enough that it gave bad accuracy. To overcome this problem, I introduced more models and learning task.&lt;/p>
&lt;p>Hence, in the final implementation, we have random forest and neural networks model for both regression and classification task. Aforementioned models outperforms linear regression. The pipeline is also already optimized.&lt;/p>
&lt;h3 id="2-drift-detection-algorithms">2. Drift detection algorithms&lt;/h3>
&lt;p>Sometimes, the built model&amp;rsquo;s performance may degrade when facing recent I/Os having different characteristics than what it was trained upon. Hence, there should be a retrain process. Retrain should be triggered. The trigger could be as simple as periodically, or using technique called drift detection. While retraining too often might cause big overhead for computation, retraining too seldom might also cause performance degradation. Hence, we should build a good and reliable drift detection algorithm that can sense the presence of concept and covariate drift in recent data.&lt;/p>
&lt;p>In order to build a good algorithm, I used heuristics derivated from the understanding about latency and throughput change over time. However, the result turns out not really good. Thus, I&amp;rsquo;ve been relying on using statistical tests as the drift detector. By far, Kalmogorov-Smirnov Test&amp;ndash;commonly known as ks-test&amp;ndash;is the best drift detector.&lt;/p>
&lt;h3 id="3-evaluation">3. Evaluation&lt;/h3>
&lt;p>The featured image in the headline of this blog, also shown below, is the result of the evaluation. I evaluated the models and drift detection algorithms using Cumulative Distribution Function (CDF) graph, to see if any tail cut is made.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Evaluation" srcset="
/report/osre23/uchicago/flashnet/20230807-rannnayy/featured_hua13ad1b86612ea35a1f0d083114566fc_25432_4866e846612d96725d801519edf06392.webp 400w,
/report/osre23/uchicago/flashnet/20230807-rannnayy/featured_hua13ad1b86612ea35a1f0d083114566fc_25432_9203cd36fc4c6de03e02a799cd564f1d.webp 760w,
/report/osre23/uchicago/flashnet/20230807-rannnayy/featured_hua13ad1b86612ea35a1f0d083114566fc_25432_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230807-rannnayy/featured_hua13ad1b86612ea35a1f0d083114566fc_25432_4866e846612d96725d801519edf06392.webp"
width="760"
height="396"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="challenges">Challenges&lt;/h2>
&lt;p>During the implementation, I encountered several challenges as follows,&lt;/p>
&lt;h3 id="1-choice-of-model">1. Choice of Model&lt;/h3>
&lt;p>Since we want to integrate the pipeline to real storage systems, we had to be mindful of model choice. Machine learning based models are lighter than deep learning based models. However, deep learning based models offer higher accuracy, thus more preferable. Hence, I implemented both and examine the effectivity of the models.&lt;/p>
&lt;h3 id="2-choice-of-drift-detection-algorithm">2. Choice of Drift Detection Algorithm&lt;/h3>
&lt;p>Continual learning technique is chosen for this task may require the model to be retrained since the workload may change over time. However, the implication is we need to have a condition that triggers the retraining to be done. As training model is costly, we need to retrain it mindfully. Thus, we use drift detection algorithm to detect whether or not retraining is needed.&lt;/p>
&lt;p>There are two types of drift detection algorithms, namely statistical based test and model based drift detection. For minimizing overhead reason, we pick statistical tests. There exists various algorithms of choice. I picked 5 of them to be implemented and evaluated.&lt;/p>
&lt;h2 id="plan">Plan&lt;/h2>
&lt;p>For the second half of the summer, I am going to study Riak and create Chameleon Trovi artifact for deploying Riak in a cluster.&lt;/p></description></item><item><title>Introducing Levels of Reproduction and Replication in Machine Learning</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230802-msaeed/</link><pubDate>Wed, 02 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230802-msaeed/</guid><description>&lt;p>Hello again,&lt;/p>
&lt;p>I am Mohamed Saeed and this is my second blog post for the 2023 Summer of Reproducibility Fellowship. As you may recall from my &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230601-msaeed">previous post&lt;/a>, I am working on the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml">Using Reproducibility in Machine Learning Education&lt;/a> project with &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> as my mentor. My goal is to create interactive open educational resources that teach reproducibility and reproducible research in machine learning (ML) as I &lt;a href="https://drive.google.com/file/d/13HnCMZawpabiLdBoOiaJFF2mNXIPLCVJ/view?usp=sharing" target="_blank" rel="noopener">proposed&lt;/a>.&lt;/p>
&lt;p>In this post, I will share with you some of the progress I have made so far, as well as some of the challenges I have faced and how I overcame them. I will also highlight some of the specific accomplishments that I am proud of and what I plan to do next.&lt;/p>
&lt;h2 id="reproducing-on-warm-starting-neural-network-training">Reproducing &amp;ldquo;On Warm Starting Neural Network Training&amp;rdquo;&lt;/h2>
&lt;p>This material is a reproduction of the paper &lt;a href="https://arxiv.org/abs/1910.08475" target="_blank" rel="noopener">&amp;ldquo;On Warm Starting Neural Network Training&amp;rdquo;&lt;/a> by Jordan T. Ash and Ryan P. Adams (2020). This paper investigates the effect of warm-starting neural networks, which means using the weights of previous models trained on a subset of the data, to train on a new dataset that has more data.
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/report/osre23/nyu/eduml/20230802-msaeed/warm_start_huf40f540ab6672b609385b58179d23d2a_3423296_0c5af6e4428dce728fe7a643b2b8e6d3.webp 400w,
/report/osre23/nyu/eduml/20230802-msaeed/warm_start_huf40f540ab6672b609385b58179d23d2a_3423296_f3e332c8b81d6d3146e54527a273bbfe.webp 760w,
/report/osre23/nyu/eduml/20230802-msaeed/warm_start_huf40f540ab6672b609385b58179d23d2a_3423296_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230802-msaeed/warm_start_huf40f540ab6672b609385b58179d23d2a_3423296_0c5af6e4428dce728fe7a643b2b8e6d3.webp"
width="760"
height="383"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
The figure illustrates how the new model uses the weights from the previous model as its initial values. This allows the new model to train on both the “Original” data, which it has already seen, and the new data, which it has not encountered before. In contrast, the randomly initialized model treats the entire data as unfamiliar and starts from scratch.&lt;/p>
&lt;p>The paper also shows that this method can lead to lower test accuracy than starting from scratch with random weights, even though the training loss is similar. The paper also proposes a simple way to improve the test accuracy of warm-starting by adding some noise to the previous weights.&lt;/p>
&lt;p>To reproduce this paper, I followed a systematic approach that ensured reliable results. This approach involved:&lt;/p>
&lt;ul>
&lt;li>Reading the paper and its main claims carefully.&lt;/li>
&lt;li>Finding out what resources the authors shared, such as code, data, and models.&lt;/li>
&lt;li>Looking for additional materials online that could help me save time and fill in the gaps left by the authors.&lt;/li>
&lt;li>Setting up the environment and dependencies needed to run the code smoothly.&lt;/li>
&lt;li>Writing code and updating any outdated functions that might cause errors.&lt;/li>
&lt;li>Running the code and verifying that it matched the results reported in the paper.&lt;/li>
&lt;li>Analyzing and interpreting the results and comparing them with the paper’s findings.&lt;/li>
&lt;/ul>
&lt;p>I used &lt;a href="https://chameleoncloud.org/" target="_blank" rel="noopener">Chameleon&lt;/a> as my platform for running and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental platform that supports computer science systems research. It allows users to create and share Jupyter notebooks that can run Python code on Chameleon’s cloud servers.&lt;/p>
&lt;p>I created a &lt;a href="https://github.com/mohammed183/re_warm_start_nn" target="_blank" rel="noopener">GitHub repository&lt;/a> where you can find all related to my reproduction work in the form of interactive jupyter notebooks that will help you learn more about machine learning and reproducibility of machine learning research.&lt;/p>
&lt;h2 id="challenges">Challenges&lt;/h2>
&lt;p>Reproducing a paper is not an easy task. I faced several challenges along the way. One of the biggest challenges was the lack of code and pretrained models from the authors. This is a common problem for many reproducibility projects. Fortunately, I found a previous reproducibility publication for this paper on &lt;a href="https://rescience.github.io/bibliography/Kireev_2021.html" target="_blank" rel="noopener">ReScience journal&lt;/a>. I used some of their code and added some new functions and modifications to match the original paper’s descriptions. I also encountered other challenges that I discussed in the notebooks with the solutions that I applied.&lt;/p>
&lt;h2 id="how-to-use-this-material">How to use this material?&lt;/h2>
&lt;p>This material is a series of notebooks that walk you through the paper and its claims, experiments, and results. You will learn how to analyze, explain, and validate the authors’ claims. To get started, I suggest you skim the &lt;a href="https://arxiv.org/abs/1910.08475" target="_blank" rel="noopener">original paper&lt;/a> briefly to get the main idea and the public information. This will help you understand how the authors could have been more clear and transparent in some sections. I have given clear instructions and explanations in the notebooks, as well as how I dealt with the missing components. You can use this material for self-learning or as an assignment by hiding the final explanation notebook.&lt;/p>
&lt;h2 id="conclusion-and-future-work">Conclusion and Future Work&lt;/h2>
&lt;p>In this blog post, I have shared with you some of my work on reproducing warm starting neural network training. I have learned a lot from this experience and gained a deeper understanding of reproducibility and reproducible research principles in ML.&lt;/p>
&lt;p>I am very happy with what I have achieved so far, but I still have more work to do. I am working on reproducing the &lt;a href="https://arxiv.org/abs/2010.11929" target="_blank" rel="noopener">Vision Transformer: An Image is Worth 16x16 Words&lt;/a> paper by Alexey Dosovitskiy et al. This time my approach is to use the available pretrained models provided by the authors to verify the claims made in the paper. However, there are some challenges that I face in reproducing the paper. For example, some of the datasets and code that the authors used are not publicly available, which makes it hard to replicate their experiments exactly. These challenges are common in reproducing research papers, especially in computer vision. Therefore, it is important to learn how to deal with them and find ways to validate some of the claims.&lt;/p>
&lt;p>I hope you enjoyed reading this blog post and found it informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!&lt;/p></description></item><item><title>Midterm Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/osu/missingsettings/20230802-ren.450/</link><pubDate>Wed, 02 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/osu/missingsettings/20230802-ren.450/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/osu/missingsettings">Measuring Research Prototypes under Unreported Settings&lt;/a> my &lt;a href="https://drive.google.com/file/d/1ouFre-qMDCL_LiH5jFNUCOI1yAYHdWcS/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yang-wang/">Yang Wang&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/miao-yu/">Miao YU&lt;/a> aims to understand the impact of missing settings in artifact evaluation.&lt;/p>
&lt;p>Based on our project proposal, the first step is to test the benchmark application on targeted systems. We pick open-source database system PostgreSQL as the target system. We test the TPC-C benchmark on PostgreSQL under default settings. We measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values. We will take these results as baseline. In order to test on more parameters and system settings, we need to choose an association of parameters to get optimal throughput.&lt;/p>
&lt;p>We use an online tool &lt;a href="https://pgtune.leopard.in.ua/#/" target="_blank" rel="noopener">PGTune&lt;/a>, which aims to tune PostgreSQL config by the hardware. We select shared_buffer, min/max_wal_size and effective_cache_size as first set of parameters to measure. They are related to memory consumption, checkpoints and planner cost in the database server. Based on PostgreSQL &lt;a href="https://www.postgresql.org/docs/current/runtime-config.html" target="_blank" rel="noopener">official documentation&lt;/a>, shared_buffer sets the amount of memory the database server uses for shared memory buffers. Max_wal_size sets the maximum size to let the WAL grow during automatic checkpoints. Larger settings for shared_buffers usually require a corresponding increase in max_wal_size, in order to spread out the process of writing large quantities of new or changed data over a longer period of time. Effective_cache_size sets the planner&amp;rsquo;s assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.&lt;/p>
&lt;p>We conduct the experiments by setting the parameters with increments and compare the throughput performance with each other and the baseline. Based on the results, the throughput of the benchmark with larger shared_buffer and max_wal_size is up to 1.5X of the performance under default settings. The improvement by tuning max_wal_size is larger than that of tuning shared_buffer. The increased effective_cache_size does not have effect for this benchmark workload compared to its default value of the system.&lt;/p>
&lt;p>There are more values of above mentioned parameters to test. Next, I will test those parameters with increments of the values. Furthemore, we need to choose an association of more parameters to get optimal throughput. Also, the tuning tool may not generate optimal values for very high memory systems based on its description. This requires we test more possible parameters and their values for better performance.&lt;/p></description></item><item><title>Midterm: High Fidelity UAV Simulation Using Unreal Engine with specular reflections</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230802-damodardatta/</link><pubDate>Wed, 02 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230802-damodardatta/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/osavc">Open Source Autonomous Vehicle Controller&lt;/a> my &lt;a href="https://drive.google.com/file/d/18g-WRZj_7ufIt6YZNn4OG1s7VKi1u5hV/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;strong>Aaron Hunter and Carlos Espinosa&lt;/strong> aims to Develop a Unreal Engine based simulator for testing. The simulator will be using Unreal Engine for the physics and visualization.&lt;/p>
&lt;h2 id="what-we-have-done-so-far">What we have done so far&lt;/h2>
&lt;ul>
&lt;li>We found that we can use Unreal Engine as a physics simulator and co-simulate with Simulink using the tools provided by MathWorks.&lt;/li>
&lt;li>Simulated a example provided by MathWorks but i wasn&amp;rsquo;t getting the expected behaviour and there were very few resource available.&lt;/li>
&lt;li>So we decided with using Gazebo and ROS for simulation instead of Unreal Engine and Simulink for the example of a balancing bot which had been designed in Solidworks.&lt;/li>
&lt;li>For using Gazebo, i had converted the Solidworks model into an URDF and imported it into Gazebo.&lt;/li>
&lt;/ul>
&lt;h2 id="future-work">Future Work&lt;/h2>
&lt;p>Currently, i am working on using Gazebo and ROS for controling a balancing bot using a PID control algorithm. Afterwards document the process of import a model into Gazebo for testing a control algorithm.&lt;/p></description></item><item><title>ScaleBugs: Reproducible Scalability Bugs</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucdavis/scalebugs/20230802-boluwarinayinmode/</link><pubDate>Wed, 02 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucdavis/scalebugs/20230802-boluwarinayinmode/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>As part of the Scalebugs Project, we have worked on building a dataset of reproducible scalability bugs. To achieve this, we go through existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. Workloads are designed to reproduce these scalability bugs by triggering some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance.&lt;/p>
&lt;p>So far we have worked on packaging the buggy and fixed versions of scalability systems, a runtime environment that ensures reproducibility, and the workloads used to trigger the symptoms of the bug inside docker containers. By packaging these versions together, we are simplifying the process of deployment and testing. This enables us to switch between different versions efficiently, aiding in the identification and comparison of the bug&amp;rsquo;s behavior. For each scalability system, we have carefully built a runtime environment that is consistent and reproducible. This approach ensures that each time we run tests or investigations, the conditions remain identical.&lt;/p>
&lt;h2 id="new-terms">New Terms&lt;/h2>
&lt;p>In order to make sense of the various bug reports, we had to learn some terminologies associated with scalability systems:&lt;/p>
&lt;p>&lt;strong>Clusters&lt;/strong>: Clusters are groups of related or connected items, often found in various fields such as computer science, data analysis, or even social sciences. For example, in data analysis, clusters might represent groups of data points with similar characteristics, making it easier to understand patterns or trends in the data.&lt;/p>
&lt;p>&lt;strong>Cluster Membership&lt;/strong>: Cluster membership refers to the process of determining which items or entities belong to a particular cluster. This task can be done based on various criteria, such as similarity in attributes, spatial proximity, or shared characteristics.&lt;/p>
&lt;p>&lt;strong>Locks&lt;/strong>: In computer programming, locks are mechanisms used to manage access to shared resources, such as files, data structures, or hardware devices. When multiple processes or threads need to access a shared resource simultaneously, locks ensure that only one process or thread can access it at a time, preventing data corruption or conflicts.&lt;/p>
&lt;p>&lt;strong>Lock Contentions&lt;/strong>: Lock contention occurs when multiple processes or threads attempt to acquire the same lock simultaneously. When this happens, one process or thread must wait until the lock becomes available, leading to potential delays and reduced performance.&lt;/p>
&lt;p>&lt;strong>Critical Paths&lt;/strong>: In project management or process analysis, a critical path is the longest chain of dependent tasks that determines the overall duration of the project or process. Any delay in tasks along the critical path will directly impact the project&amp;rsquo;s completion time.&lt;/p>
&lt;p>&lt;strong>Tokens&lt;/strong>: Tokens can have various meanings depending on the context. In computer programming, tokens are the smallest units of source code recognized by a compiler or interpreter. In cryptography, tokens can represent digital certificates or authentication data used for secure communication.&lt;/p>
&lt;p>&lt;strong>Nodes&lt;/strong>: In the context of network theory or graph theory, nodes are individual points or entities that form a network or graph. In a computer network, nodes can be devices like computers or routers, and in a social network, nodes can represent individuals or entities.&lt;/p>
&lt;p>&lt;strong>Peers&lt;/strong>: Peers are entities within a network that have the same status or capabilities. In peer-to-peer networks, each node can act as both a client and a server, enabling direct communication between nodes without relying on a central server.&lt;/p>
&lt;p>&lt;strong>Gossipers, Gossip Protocol&lt;/strong>: In distributed systems, gossipers are nodes that share information with each other using the gossip protocol. The gossip protocol involves randomly selecting peers and exchanging information in a decentralized manner, allowing information to spread quickly across the network.&lt;/p>
&lt;p>&lt;strong>Threads&lt;/strong>: Threads are the smallest units of execution within a process in computer programming. Multiple threads can run concurrently within a single process, enabling multitasking and parallel processing. Threads can share the same resources within the process, making them more lightweight than separate processes. However, proper synchronization is essential to prevent data corruption or conflicts when multiple threads access shared resources.&lt;/p>
&lt;p>&lt;strong>Flush and Writes Contention&lt;/strong>: This refers to a situation where simultaneous operations involving data flushing (saving data to a storage medium) and data writing (updating or adding data) are causing conflicts or delays. This contention can arise when multiple processes or threads attempt to perform these operations concurrently, leading to performance bottlenecks or potential data integrity issues.&lt;/p>
&lt;h2 id="accomplishments">Accomplishments&lt;/h2>
&lt;p>We have been able to build docker containers for the following scalability bugs:&lt;/p>
&lt;p>&lt;strong>IGNITE 12087&lt;/strong>&lt;/p>
&lt;p>This bug stems from the resolution of the IGNITE-5227 issue (another bug), which has led to a significant decline in the performance of a particular operation. Prior to addressing IGNITE-5227, the insertion of 30,000 entries displayed remarkable efficiency, completing in roughly 1 second. However, post the resolution, executing the same insertion process for 30,000 entries witnessed a considerable slowdown, taking approximately 130 seconds – a performance degradation of nearly 100 times.&lt;/p>
&lt;p>&lt;strong>CASSANDRA 14660&lt;/strong>&lt;/p>
&lt;p>This bug is related to how clusters work together and how a lock is causing conflicts with the critical path. The issue arises from a method call that uses O(Peers * Tokens) resources while contending for a lock, which is causing problems in the write path. The lock is used to protect cached tokens that are essential for determining the correct replicas. The lock is implemented as a synchronized block in the TokenMetadata class.&lt;/p>
&lt;p>&lt;em>How was this fixed?&lt;/em>&lt;/p>
&lt;p>It was fixed by reducing the complexity of the operation to O(Peers) taking advantage of some properties of the token list and the data structure.&lt;/p>
&lt;p>&lt;strong>CASSANDRA 12281&lt;/strong>&lt;/p>
&lt;p>This bug is also related to how clusters work together and a lock conflict. The issue arises when a specific method is trying to access a lot of resources (O(Tokens^2)) while contending for a read lock. As reported, a cluster with around 300 nodes has around 300 * 256 (assuming the default number of tokens) tokens, thus joining a new member reportedly is taking more than 30 mins. This happens because due to the long execution time here, this lock makes every gossip message delayed, so the node never becomes active.&lt;/p>
&lt;p>&lt;em>How was this fixed?&lt;/em>&lt;/p>
&lt;p>The granularity of the lock is decreased, meaning that the expensive function calls now do not take the problematic read lock and simply use a synchronized block, synchronizing on a specific field, that does the job much better.&lt;/p>
&lt;p>&lt;strong>HA16850&lt;/strong>&lt;/p>
&lt;p>This is a bug related to obtaining thread information in the JvmMetrics package. When obtaining thread information, the original buggy version used MXBeans to obtain thread information. The call uses an underlying native implementation that holds a lock on threads, preventing thread termination or creation. This means that the more threads that we have to obtain information for, the longer the function call will hold a lock. The result is that the execution time scales on the number of active threads O(threads).&lt;/p>
&lt;p>&lt;em>How was this fixed?&lt;/em>&lt;/p>
&lt;p>Developers utilized a ThreadGroup to keep track of obtaining metrics for threads. The result is that there is no lock held for every thread.&lt;/p>
&lt;p>&lt;strong>CA13923&lt;/strong>&lt;/p>
&lt;p>This issue revolves around conflicts between the &amp;ldquo;flush&amp;rdquo; and &amp;ldquo;writes&amp;rdquo; processes. The main problem is that during the &amp;ldquo;flush&amp;rdquo; process, a resource-intensive function called &amp;ldquo;getAddressRanges&amp;rdquo; is invoked. This function has a high computational cost and its complexity is O(Tokens^2). In other words, the time it takes to complete this function grows quickly as the number of &amp;ldquo;tokens&amp;rdquo; increases. This situation is causing challenges and delays in the overall process.&lt;/p>
&lt;p>&lt;em>How was this fixed?&lt;/em>&lt;/p>
&lt;p>This function call affected many paths and they made sure no one calls getAddressRanges in critical paths.&lt;/p>
&lt;h2 id="challenges">Challenges&lt;/h2>
&lt;p>&lt;strong>Demanding Memory Requirements&lt;/strong>: Running certain builds consumes a significant amount of memory. This places a strain on system resources and can impact the overall performance and stability of the process.&lt;/p>
&lt;p>&lt;strong>Little Issues Impacting Execution&lt;/strong>: Often, seemingly minor details can obstruct the successful execution of a build. Resolving such issues requires thorough investigation and extensive research into similar problems faced by others in the past.&lt;/p>
&lt;p>&lt;strong>Complexities of Scalability Bugs&lt;/strong>: Identifying the underlying causes of scalability-related bugs is intricate. These bugs exhibit unique characteristics that can complicate the process of pinpointing and comprehending their root origins.&lt;/p>
&lt;h2 id="what-is-docker--for-those-who-dont-know-about-it-">What is Docker? ( For those who don&amp;rsquo;t know about it )&lt;/h2>
&lt;p>Docker is a platform that facilitates the containerization of applications, leading to consistent and efficient deployment across diverse environments. Its benefits include portability, resource efficiency, isolation, and rapid development cycles. DockerHub complements Docker by providing a centralized hub for sharing and accessing container images, fostering collaboration and ease of use within the Docker ecosystem.&lt;/p>
&lt;p>More about docker &lt;a href="https://docs.docker.com/get-started/overview/" target="_blank" rel="noopener">https://docs.docker.com/get-started/overview/&lt;/a>&lt;/p></description></item><item><title>Mid-term blog post for Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edunet/20230801-srishti-j18/</link><pubDate>Tue, 01 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edunet/20230801-srishti-j18/</guid><description>&lt;p>Hello!&lt;/p>
&lt;p>I am Srishti Jaiswal and this is my second blog post for the 2023 Summer of Reproducibility Fellowship.&lt;/p>
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>As I reach the halfway mark of my internship journey, I have had the incredible opportunity to work on a project that revolves around reproducing an adaptive video research result using cloud-based experimentation. This blog post delves into my exciting work so far, the significant milestones achieved, specific accomplishments to celebrate, and the challenges overcome. Utilizing CloudLab and FABRIC, I embarked on a journey to reproduce essential figures from the research paper &lt;a href="https://dl.acm.org/doi/10.1145/2491172.2491179" target="_blank" rel="noopener">Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming&lt;/a>, ensure Python2 and Python3 compatibility and incorporate an Estimated Download Rate column in the log file produced by the video client. Let&amp;rsquo;s explore the details of this captivating internship experience.&lt;/p>
&lt;h2 id="major-milestones-reached">Major Milestones Reached&lt;/h2>
&lt;p>Here are the milestones we have reached so far:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Familiar with CloudLab and Fabric Testbeds: I learned how to run an adaptive video experiment, which is the jumping-off point for my project, on the CloudLab and FABRIC platforms.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Python2 and Python3 Compatibility: My first task was to port an existing open-source code base developed for Python2 (which is no longer supported) so that it can run in Python3.
Now code is running successfully in both versions for all the policies of the existing open source, i.e. Basic, Netflix and Sara.
Fixed &lt;a href="https://github.com/Srishti-j18/AStream/issues/1" target="_blank" rel="noopener">issue#1&lt;/a> .&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Estimated Download Rate for Basic Policy: To make it easier for users to understand and visualize how the adaptive video policy works, I added an additional metric, “Estimated Download Rate”, to the output file produced by the adaptive video client.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Graphing Buffer Occupancy and Estimated Download Rate: I extended the existing experiment to show two additional visualizations that are important for understanding how the adaptive video client works: buffer occupancy vs time and estimated download rate vs time.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="overcoming-challenges">Overcoming Challenges&lt;/h2>
&lt;p>I encountered several challenges throughout this project, especially as it was my first time working independently on a research paper as a third-year engineering student. However, with my mentor&amp;rsquo;s guidance and support, I persevered and learned to tackle each obstacle with determination.&lt;/p>
&lt;p>One significant challenge was porting the entire code from Python2 to Python3. This transition resulted in numerous errors, and I often found it challenging to pinpoint where the mistakes occurred. To overcome this, I adopted a step-by-step approach, fixing errors one by one and verifying them using Python2 for comparison.&lt;/p>
&lt;p>Understanding the complex codebase was another hurdle that led to moments of feeling stuck in an infinite loop. But every time I faced such situations, I sought my mentor&amp;rsquo;s advice, and together, we made strategic changes to overcome these challenges.&lt;/p>
&lt;p>I am immensely grateful for my mentor&amp;rsquo;s expertise and support throughout this internship. Her guidance played a crucial role in helping me navigate through the challenges and grow both professionally and personally. I eagerly look forward to the rest of the journey, knowing that I can continue making meaningful contributions to this research project with her inspiring mentorship.&lt;/p>
&lt;h2 id="future-prospects">Future Prospects&lt;/h2>
&lt;p>As the second half of my internship approaches, I am eager to refine further and expand our experimentation.
Our main aim is to reproduce the existing work and provide a clear guide for other students to do the same for this, I have to create a framework that helps them improve and build upon this work.&lt;/p>
&lt;p>I hope you enjoyed reading this blog post.If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!&lt;/p></description></item><item><title>Midterm: Open Source Autonomous Vehicle Controller</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230801-25chilingh/</link><pubDate>Tue, 01 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230801-25chilingh/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/osavc">Open Source Autonomous Vehicle Controller Project&lt;/a> my &lt;a href="https://docs.google.com/document/d/1hDU87aAzbn88vWwOHH0ggIID2W4KKzp8SKF1Lb8LU90/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;strong>Aaron Hunter and Carlos Espinosa&lt;/strong> aimed to create comprehensive technical documentation to help onboard new users of the OSAVC controller.&lt;/p>
&lt;p>I have accomplished the following:&lt;/p>
&lt;ul>
&lt;li>From the KiCad Schematic Editor, created pinouts of the I/O connectors on the OSAVC.&lt;/li>
&lt;li>Detailed a hardware overview of the OSAVC by labeling and describing each electrical component.&lt;/li>
&lt;li>Documented the setup for loading code on the OSAVC, including software such as Git, MPLAB X, XC32 Compiler, and serial terminal and hardware by showing how to connect the PICKit3 and OSAVC to a PC.&lt;/li>
&lt;li>Tested the OSAVC by receiving and transmitting characters in the serial port into a buffer.&lt;/li>
&lt;li>Fixed bugs/errors in the NEO_M8N GPS module library and PWM motors library.&lt;/li>
&lt;li>Created a new library for the uni and bidirectional ESC brushless motors.&lt;/li>
&lt;li>Created a user-interfaced test harness for all peripherals: serial, IMU, GPS, encoder, PWM actuators, radio telemetry, Mavlink heartbeat, radio controller, and LIDAR.&lt;/li>
&lt;li>Incorporated new user interface element and fixed video streaming errors in the Flask app running on the Raspberry Pi 4 communicating with the OSAVC.&lt;/li>
&lt;li>Documented both software and hardware steps to run the OSAVC with a companion computer such as a Raspberry Pi 4.&lt;/li>
&lt;li>Highlighted common problems encountered with the OSAVC.&lt;/li>
&lt;li>Created a contributor&amp;rsquo;s guide for others to create new libraries or contribute to the OSAVC project.&lt;/li>
&lt;li>Designed a &lt;a href="https://grabcad.com/library/ptn78020w-1" target="_blank" rel="noopener">switching voltage regulator&lt;/a> in SOLIDWORKS&lt;/li>
&lt;li>Designed a self balancing bot that employs the OSAVC in SOLIDWORKS&lt;/li>
&lt;/ul>
&lt;h2 id="future-work">Future Work&lt;/h2>
&lt;p>Currently, the laser cutter at UCSC is in maintenance, so we couldn&amp;rsquo;t assemble the self balancing bot yet. Once we assemble it, I will finish and document the control algorithms. We can also try incorporating ML models on the Raspberry Pi with the Coral USB accelerator on the self balancing bot.&lt;/p></description></item><item><title>Reproducible Analysis &amp; Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230801-shayantan/</link><pubDate>Tue, 01 Aug 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230801-shayantan/</guid><description>&lt;p>We are currently midway into the OSRE 2023 program and the following post lists the progress that I have made on the project so far.
As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/">Reproducible Analysis &amp;amp; Models for Predicting Genomics Workflow Execution Time&lt;/a> our overall goal was to enumerate the effect of sequence data quality on execution times. Towards that end, we decided to first identify suitable datasets from the two commmonly available -omics data modalities - transcriptomics and genomics. Albrecht et al. [1] developed &lt;em>&lt;strong>seqQscorer&lt;/strong>&lt;/em> to automate the quality control step of NGS data analysis through predictive modeling. They have also published the list of ENCODE datasets used for training the models. Quality label has been assigned as 0 for released files or 1 for revoked files. Based on the guidelines set forth by ENCODE&amp;rsquo;s Data Coordination Centre (DCC) a comprehensive manual annotation of the data was done by scientists and the resulting quality variable &amp;ldquo;status&amp;rdquo; was published to serve as an indication of the quality of the data. The following steps outline the process of generating the data table for building the machine learning models.&lt;/p>
&lt;ul>
&lt;li>Step 1: Programmatically accessed 86 (34 released ; 34 revoked) RNA-seq files from ENCODE database. All the fastq files were single ended.&lt;/li>
&lt;li>Step 2: Programmatically accessed 288 (144 released ; 144 revoked) DNA-seq files from ENCODE database. All the fastq files were paired ended.&lt;/li>
&lt;li>Step 3: Implemeted the STAR aligner for RNA-seq and the BWA aligner for DNA seq. The resulting outputs contained the alignment times for both the &amp;ldquo;revoked&amp;rdquo; and &amp;ldquo;released&amp;rdquo;.&lt;/li>
&lt;li>Step 4: Ran statistical tests to determine whether there is any significant differences in the runtimes of both types of files.&lt;/li>
&lt;/ul>
&lt;p>Currently I am running the FASTQC tool to extract data quality metrics for the same set of files as discsussed above. Once I have collected those metrics, I can start building regression models to determine whether there is any significant impact of data quality on execution time. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Through our analysis we aim to develop a reproducible ML model that will give the user an estimate of the runtime based on the raw FATSQ file as input.&lt;/p>
&lt;h2 id="references">References&lt;/h2>
&lt;p>[1] Albrecht, S., Sprang, M., Andrade-Navarro, M.A. &lt;em>et al.&lt;/em> seqQscorer: automated quality control of next-generation sequencing data using machine learning. &lt;em>Genome Biol&lt;/em> &lt;strong>22&lt;/strong>, 75 (2021). &lt;a href="https://doi.org/10.1186/s13059-021-02294-2" target="_blank" rel="noopener">https://doi.org/10.1186/s13059-021-02294-2&lt;/a>&lt;/p></description></item><item><title>[Mid-term] Capturing provenance into Data Science/Machine Learning workflows</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230731-jesselima/</link><pubDate>Mon, 31 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230731-jesselima/</guid><description>&lt;p>This post describes our midterm work status and some achievements we have done so far in &lt;a href="https://docs.google.com/document/d/1YMtPjZXcgt5eplyxIgQE8IBpQIiRlB9eqVSQiIPhXNU/edit#heading=h.nnxl1g16trg0" target="_blank" rel="noopener">the project&lt;/a> for the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/noworkflow/">noWorkflow&lt;/a> package.&lt;/p>
&lt;h4 id="the-initial-weeks">The initial weeks&lt;/h4>
&lt;p>I started doing a bibliographical review on reproducibility in the Data Science (DS) and Machine Learning (ML) realms. It was a new subject to me, and I aimed to build a more robust theoretical background in the field. Meanwhile, I took notes in &lt;a href="https://jaglima.github.io/" target="_blank" rel="noopener">this series of posts&lt;/a>.&lt;/p>
&lt;p>Then, as planned, I integrated with the current noWorkflow supporters in order get a broader view of the project and their contributions. Additionally, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/juliana-freire/">Juliana Freire&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joao-felipe-pimentel/">João Felipe Pimentel&lt;/a>, and I set up a weekly one-hour schedule to keep track of my activities.&lt;/p>
&lt;h3 id="brainstormed-opportunities">Brainstormed opportunities&lt;/h3>
&lt;p>At the beginning of June, we also met with other project supporters to brainstorm about our initial proposal. From this meeting, we came up with a plan on how technically approach a noWorkflow new feature in Data Science and Machine Learning experimental management.&lt;/p>
&lt;p>In this brainstorm, we aligned that &lt;em>Jupyter Notebooks are, by far, the most frequent set up in DS/ML computational experiments. They established themselves as the fundamental artifact by embedding code, text and enabling execution and visualization. Entire experiments are created and kept in Jupyter notebooks until they are sent to production. And the opportunity at hand is to integrate noWorkflow with Jupyter Notebooks&lt;/em>.
Then, our mid-term goal was adapted from the original plan of only selecting and executing a prototypical ML experiment. We added the goal of paving the way for providing a tagging feature for Notebook cells.&lt;/p>
&lt;p>More specifically, DS/ML experimental workflows usually have well-defined stages composed of &lt;em>data reading&lt;/em>, &lt;em>feature engineering&lt;/em>, &lt;em>model scoring&lt;/em>, and &lt;em>metrics evaluation&lt;/em>. In our dream space, the user would tag a cell in their experiment, enabling the capture of the tagged metadata into a database. This step integrates the ultimate goal of facilitating comparisons, management, and even causal inference across different trials of a DS/ML experiment.&lt;/p>
&lt;h3 id="current-deliverables">Current deliverables&lt;/h3>
&lt;p>So, based on our plans, we create a separate table to store the metadata from cell tagging. This table stores the cell hash codes and information to match the code executed within a cell. As a result, we can store tags and the activation ids of the cells enabling us to identify a cell containing a given stage in a DS/ML experiment.&lt;/p>
&lt;p>The second feature implemented was tagging a specific variable. In the same way for a cell, now it is possible to stamp a given variable with a tag, keeping its name, id, and received value in this separated table.&lt;/p>
&lt;p>Finally, we worked on displaying the dependencies of a given variable. In this case, by tagging a given variable, we can display the other variables, values, and cells activated in its construction. Then, we can visualize the dependencies that contributed to its final value.&lt;/p>
&lt;p>For an overview of current developments, please refer to my &lt;a href="https://github.com/jaglima/noworkflow/tree/stage_tagging" target="_blank" rel="noopener">fork of the main project&lt;/a>.&lt;/p>
&lt;h3 id="challenges">Challenges&lt;/h3>
&lt;p>During this period, we had to make choices along the way. For instance, capturing the provenance of cells through tags is a different solution than tagging code chunks in scripts. In this case, we decided to stick with tagging Notebook cells at this moment. We also opted to start storing the metadata to enable comparisons between trials rather than focus on a sophisticated graphic and user-friendly cell tagging system. We also opted to keep this metadata info stored in a separate table in the database.&lt;/p>
&lt;h3 id="next-steps">Next steps&lt;/h3>
&lt;p>In the second half of the summer, our goal is to integrate these features in order to proceed with comparisons among experiments. Such comparisons would use the tagged variables as the hyperparameters of DS/ML experiments or key variables to assess the experiments, such as errors or scores. As a result, we will be able to compare the results of two trials in a more accurate, and easily reproducible experiment.&lt;/p></description></item><item><title>Implemented IO uring for Key-Value Drives</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230730-manank/</link><pubDate>Mon, 31 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230730-manank/</guid><description>&lt;p>Hi everyone!&lt;/p>
&lt;p>I&amp;rsquo;m Manank Patel, (&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230526-manank">link&lt;/a> to my Introduction post) and am currently working on &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/kvstore">Efficient Communication with Key/Value Storage Devices&lt;/a>. The goal of the project was to leverage the capabilities of io_uring and implement a new backend driver.&lt;/p>
&lt;p>In the existing sockets backend, we use non-blocking sockets with looping to ensure all the data is written. Here is a simplified flow diagram for the
same. The reasoning behind using non blocking sockets and TCP_NODELAY is to get proper network utilization. This snippet from the code explains it further.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">NODELAY means that segments are always sent as soon as possible,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">even if there is only a small amount of data. When not set,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">data is buffered until there is a sufficient amount to send out,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">thereby avoiding the frequent sending of small packets, which
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">results in poor utilization of the network. This option is
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">overridden by TCP_CORK; however, setting this option forces
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">an explicit flush of pending output, even if TCP_CORK is
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">currently set.
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Sockets flow" srcset="
/report/osre23/ucsc/kvstore/20230730-manank/ktli_socket_huf9f86d17a6f220de349bb1b61ce1052f_93743_fe3f3d8030752b92e5fb87ea1d67e0c2.webp 400w,
/report/osre23/ucsc/kvstore/20230730-manank/ktli_socket_huf9f86d17a6f220de349bb1b61ce1052f_93743_44c789c0dc2dbae770c40595d35ae941.webp 760w,
/report/osre23/ucsc/kvstore/20230730-manank/ktli_socket_huf9f86d17a6f220de349bb1b61ce1052f_93743_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230730-manank/ktli_socket_huf9f86d17a6f220de349bb1b61ce1052f_93743_fe3f3d8030752b92e5fb87ea1d67e0c2.webp"
width="469"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>In the above figure, we have a &lt;a href="https://gitlab.com/kinetic-storage/libkinetic/-/blob/manank/src/ktli_socket.c?ref_type=heads#L436" target="_blank" rel="noopener">loop&lt;/a> with a writev call, and we check the return value and if all the data has not been written, then we modify the
offsets and then loop again, otherwise, if all the data has been written, we exit the loop and return from the function. Now this works well with traditional sockets, as we get the return value from the writev call as soon as it returns. In case of io_uring, if we try to follow the same design, we get the
following flow diagram.
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="uring flow" srcset="
/report/osre23/ucsc/kvstore/20230730-manank/ktli_uring_nonb_huf47400b8be9e2650586ffc8c37d95fc6_108831_eaf262f65651ce613bf0a033f897afde.webp 400w,
/report/osre23/ucsc/kvstore/20230730-manank/ktli_uring_nonb_huf47400b8be9e2650586ffc8c37d95fc6_108831_bc898fc227145dff9464f87e8f66363f.webp 760w,
/report/osre23/ucsc/kvstore/20230730-manank/ktli_uring_nonb_huf47400b8be9e2650586ffc8c37d95fc6_108831_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230730-manank/ktli_uring_nonb_huf47400b8be9e2650586ffc8c37d95fc6_108831_eaf262f65651ce613bf0a033f897afde.webp"
width="417"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Here, as you can see, there are many additional steps/overhead if we want to check the return value before sending the
next writev, as we need to know how many bytes has been written till now to change the offsets and issue
the next request accordingly. Thus, in every iteration of the loop we need to to get an sqe, prep it for writev, then
submit it, and then get a CQE, and then wait for the CQE to get the return value of writev call.&lt;/p>
&lt;p>The alternate approach would be to write the full message/iovec atomically in one call, as shown in following diagram.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="possible uring flow" srcset="
/report/osre23/ucsc/kvstore/20230730-manank/ktli_uring_ideal_hu2d99f0bee974127b66eb083c255358d0_60614_df20a0788e55e56bf7af70d91c7275c6.webp 400w,
/report/osre23/ucsc/kvstore/20230730-manank/ktli_uring_ideal_hu2d99f0bee974127b66eb083c255358d0_60614_056949985d6ef71540ba0c4992f11376.webp 760w,
/report/osre23/ucsc/kvstore/20230730-manank/ktli_uring_ideal_hu2d99f0bee974127b66eb083c255358d0_60614_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230730-manank/ktli_uring_ideal_hu2d99f0bee974127b66eb083c255358d0_60614_df20a0788e55e56bf7af70d91c7275c6.webp"
width="535"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>However, on trying this method, and running fio tests, we noticed that it worked well with smaller block sizes, like
16k, 32k and 64k, but was failing constantly with larger block sizes like 512k or 1m. This was because it was not able to
write all the data to the socket in one go. This method showed good results as compared to sockets backend (for small BS
i.e). We tried to increase the send/recv buffers to 1MiB-10MiB but it still struggled with larger blocksizes.&lt;/p>
&lt;p>Going forward, we discussed a few ideas to understand the performance trade-offs. One is to use a static variable and increment it on
every loop iteration, in this way we can find out if that is really the contirbuting factor to our problem. Another idea
is to break down the message in small chunks, say 256k and and set up io uring with sqe polling and then link and submit
those requests in loop, without calling io_uring_submit and waiting for CQE. The plan is to try these ideas, discuss and
come up with new ideas on how we can leverage io_uring for ktli backend.&lt;/p></description></item><item><title>Improving Video Applications' Accuracy by Enabling The Use of Concierge</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/edgebench/20230731-zharfanf/</link><pubDate>Mon, 31 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/edgebench/20230731-zharfanf/</guid><description>&lt;style>
p {
text-align: justify;
}
img {
display: block;
margin-left: auto;
margin-right: auto;
}
&lt;/style>
&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Hello, it&amp;rsquo;s me again, Faishal, a SoR project contributor for the edgebench project. For the past these two months, my mentors and I have been working on improving the performance of our system. In this report, I would like to share with you what we have been working on.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Edgebench is a project that focuses on how to efficiently distribute resource (bandwidth and cpu usage) across several video applications. Nowaday&amp;rsquo;s video applications process its data or video on a server or known as edge computing, hence bandwidth or compute unit may be the greatest concern if we talk about edge computing in terms of WAN, because it is strictly limited.&lt;/p>
&lt;p>Consider the following case, suppose we have 3 video applications running that is located in several areas across a city. Suppose the total bandwidth allocated to those 3 video applications is also fixed. Naively, we may divide the bandwidth evenly to every camera in the system. We may have the following graph of the allocated bandwidth overtime.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/baseline_alloc.png" alt="Baseline Allocation" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>They are fixed and won’t change. However, every video application has its own characteristic to deliver such a good result or f1-score. It is our task to maintain high average f1-score. Therefore we need to implement a new solution which is accuracy-oriented. The accuracy-gradient&lt;a href="%28#acc%29">[1]&lt;/a> comes into this.&lt;/p>
&lt;h2 id="system-design">System Design&lt;/h2>
&lt;p>On our current design, we need a resource allocator, namely concierge. This concierge determines how much bandwidth is needed for every video application (vap) in the system. Concierge will do the allocation at a certain time interval that has been determined before. This process is called profiling, on this process, the concierge will first ask every vap to calculate their f1-score at a certain video segment when the bandwidth is added by profile_delta. Then the difference of this f1-score is substracted by the default f1-score, namely &lt;code>f1_diff_high&lt;/code>. After that, the concierge will ask to reduce its bandwidth by profile_delta and do the same process as before, this result will be named &lt;code>f1_diff_low.&lt;/code> Those two results will be sent to the concierge for the next step. On the concierge, there will be sensitivity calculation, where sensitivity is&lt;/p>
&lt;!-- pada sistem yang kami desain, kami membutuhkan sebuah resource allocator yang kami namakan concierge. Concierge ini yang akan menentukan berapa besarnya bandwidth yang dibutuhkan pada tiap video application. Concierge akan melakukan penentuan bw dalam interval yang sudah ditentukan sebelumnya, pada tahap ini, concierge akan meminta kepada seluruh video aplikasi untuk menghitung f1-score pada segmen video tertentu ketika alokasi bandwidth pada aplikasi itu dinaikan sebesar delta yang sudah ditentukan pula. Setelah itu, the difference of f1-score disimpan pada variabel f1_diff_high. Lalu concierge akan meminta f1-score ketika bw akan diturunkan sebesar delta. Akan pula dihitung the difference-nya. Kedua hasil tersebut akan dikirimkan oleh video aplikasi kepada concierge untuk dilakukan perhitungan selanjutnya. -->
&lt;!-- Pada concierge, akan dilakukan perhitungan sensitivity. Where sensitivity -->
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://latex.codecogs.com/svg.image?&amp;amp;space;sensitivity[i]=f1%5c_diff%5c_high[i]-%5cSigma_%7bk=1%7d%5enf1%5c_diff%5c_low[k];k%5cneq&amp;amp;space;i&amp;amp;space;" alt="sensitivity[i] = f1_diff_high[i] - \Sigma_{k=1}^nf1_diff_low[k]; k \neq i" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>This equation tells us which video application will give us the best f1-score improvement if we add more bandwidth to one vap while reducing other&amp;rsquo;s bandwidth. From this, we will optimize and the concierge will give the bandwdith to the one with the highest sensitivity and take the bandwidth from the app with the lowest sensitvity.&lt;/p>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>As aforementioned, our main objective is to improve the accuracy. However, there are two parameters that will be taken into account which are improvement and the overhead of its improvement. We first choose 3 dds apps&lt;a href="#dds">[2]&lt;/a> that we think will be our ideal case. The following graphs show the profile of our ideal case&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/ideal_case.png" alt="Ideal Case" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>We can see that two of them have high sensitivity especially on lower bandwidth and one of them has low sensitivity. This is a perfect scenario since we may sacrifice one&amp;rsquo;s bandwidth and give it to the app that has the highest sensitivity at that iteration. We will do the experiment under the following setup&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">&lt;span class="nv">DATASETS&lt;/span>&lt;span class="o">=(&lt;/span>&lt;span class="s2">&amp;#34;&amp;#34;&lt;/span> &lt;span class="s2">&amp;#34;uav-1&amp;#34;&lt;/span> &lt;span class="s2">&amp;#34;coldwater&amp;#34;&lt;/span> &lt;span class="s2">&amp;#34;roppongi&amp;#34;&lt;/span>&lt;span class="o">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">MAX_BW&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="m">1200&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">PROFILING_DELTA&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="m">80&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nv">MI&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="m">5&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>That setup block tells us we will use the total bandwith of 1200 kbps, that means at first we will distribute the bandwidth evenly (400 kbps). The profiling_delta will be 80 kbps and profiling interval (&lt;code>MI&lt;/code>) will be 5 seconds.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/merged_ideal.png" alt="Merged Ideal" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th style="text-align:center">&lt;strong>Mode&lt;/strong>&lt;/th>
&lt;th style="text-align:center">&lt;em>DDS&lt;/em> &lt;br> (&lt;span style="color:blue">&lt;em>uav-1&lt;/em>&lt;/span>)&lt;/th>
&lt;th style="text-align:center">&lt;em>DDS&lt;/em> &lt;br> (&lt;span style="color:orange">&lt;em>coldwater&lt;/em>&lt;/span>)&lt;/th>
&lt;th style="text-align:center">&lt;em>DDS&lt;/em> &lt;br> (&lt;span style="color:green">&lt;em>roppongi&lt;/em>&lt;/span>)&lt;/th>
&lt;th style="text-align:center">Average&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td style="text-align:center">Baseline&lt;/td>
&lt;td style="text-align:center">0.042&lt;/td>
&lt;td style="text-align:center">0.913&lt;/td>
&lt;td style="text-align:center">0.551&lt;/td>
&lt;td style="text-align:center">0.502&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td style="text-align:center">&lt;strong>Concierge&lt;/strong>&lt;/td>
&lt;td style="text-align:center">0.542&lt;/td>
&lt;td style="text-align:center">0.854&lt;/td>
&lt;td style="text-align:center">0.495&lt;/td>
&lt;td style="text-align:center">&lt;strong>0.63&lt;/strong> (&lt;span style="color:green">&lt;em>+25.5%&lt;/em>&lt;/span>)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>From the result, we managed to improve the average f1-score by &lt;strong>0.1&lt;/strong> or &lt;strong>25.5%&lt;/strong>. This is obviously a very good result. There are a total of 10 videos in our dataset, for the next experiment, we first will generate 6 combinations of dds apps. Noted that for each combination, one video will be uav-1 since we know that it has the highest sensitivity. We will the experiment with 4 bandwidth scenarios &lt;strong>(1200, 1500, 1800, 2100)&lt;/strong> in kbps.&lt;/p>
&lt;!-- dari hasil tersebut, kita telah berhasil meng-improve rata-rata f1-score sebesar 0.1 atau 13.5% Hal ini tentu saja merupakan sebuah hasil yang sangat baik. Selanjutnya kami melakukan tes yang sama namun dengan video yang berbeda. setupnya demikian -->
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/only_uav_merged.png" alt="Only Uav-1" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The left figure depicts the average improvement of the concierge. Here we can see that the improvement decreases when the total bandwidth increases. The reason behind this is at a higher bandwidth, the sensitivity tends to be closer to 0 and the concierge won&amp;rsquo;t do any allocation. Overall, this confirms our previous result that with the help of uav-1, the concierge can improve the f1-score up to 0.1. The next experiment is to randomly pick 3 dds videos out of 10 videos that will be generated 10 times. We would like to see how it perfoms without any help of uav-1.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/random_merged.png" alt="Random Merged" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>From the result, we still managed to get the improvement. However, it seems that average improvement decreases compared to the previous one. The reason of this phenomenon will be discussed later.&lt;/p>
&lt;h3 id="overhead-measurement">Overhead Measurement&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/overhead_1.png" alt="Overhead Measurement" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>From the graph above, each graph represents the total bandwidth used. In this experiment, it is clearly known that the lower MI leads to higher overhead since there would be more profiling process than higher MI. From the 4 graphs above, it can be known that there would be a significant trade off if we lower the MI since the improvement itself is not highly significant. The highest improvement is at &lt;strong>1200kbps&lt;/strong>. Hence, for higher bandwidth, there is no need to do the profiling too often&lt;/p>
&lt;h2 id="discussion">Discussion&lt;/h2>
&lt;p>There are some limitations of our current design. If we have a look at box-plot in figure 5 above, we can see that there is some combinations where the improvement is negative.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="./images/recovery_failed.png" alt="Failed Recovery" loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The figure above depicts the profiling process from the segment 6 to determine the bandwidth used at segment 7. Here we can see that the f1-score at that bandwidth for (&lt;span style="color:blue">&lt;em>jakarta&lt;/em>&lt;/span>) drops significantly. Our current design cannot address this issue yet since we only consider current video segment. There is a need to not only look at current segment, but also the previous and the future segment should be taken into account as well.&lt;/p>
&lt;p>Regarding the overhead, we are aware that 50% overhead is still considered bad. We might as well try the dynamic &lt;code>MI&lt;/code> or skip the profiling for certain video if not neccesarry.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Regardless the aforementioned limitations, this report shows that the concierge is generally capable of giving an f1-score improvement. The update of the next will be shown in the final report later.&lt;/p>
&lt;h2 id="references">References&lt;/h2>
&lt;p>&lt;a id="acc">[1]&lt;/a> &lt;a href="https://drive.google.com/file/d/1U_o0IwYcBNF98cb5K_h56Nl-bQJSAtMj/view?usp=sharing" target="_blank" rel="noopener">https://drive.google.com/file/d/1U_o0IwYcBNF98cb5K_h56Nl-bQJSAtMj/view?usp=sharing&lt;/a> &lt;br>
&lt;a id="dds">[2]&lt;/a> Kuntai Du, Ahsan Pervaiz, Xin Yuan, Aakanksha Chowdhery, Qizheng Zhang, Henry Hoffmann, and Junchen Jiang. 2020. Server-driven video streaming for deep learning inference. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 557–570.&lt;/p></description></item><item><title>Mid-term blog post for Public Artifact Data and Visualization</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20230731-zjyhhhhh/</link><pubDate>Mon, 31 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20230731-zjyhhhhh/</guid><description>&lt;p>Over the past few weeks, our platform development has been progressing steadily, and we are excited to share the milestones we have achieved so far. As planned in our &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20230617-zjyhhhhh">introductory blog&lt;/a>, we have successfully laid the groundwork for the platform with the guidance and support of our mentor.&lt;/p>
&lt;h2 id="milestones-and-accomplishments">Milestones and Accomplishments&lt;/h2>
&lt;p>Here are some of the key functionalities we have implemented so far:&lt;/p>
&lt;ol>
&lt;li>Modular Architecture: We successfully designed the platform with a modular architecture, separating the Graphical User Interface (GUI) and Command-Line Interface (CLI) functionalities. This modularity allows users to interact with the platform in their preferred way.&lt;/li>
&lt;li>Experiment and Bucket Creation: Users can now create experiments, buckets (for storing different implementations of experiments), and iterations using either the GUI or CLI.&lt;/li>
&lt;li>Real-time Backend Environment Monitoring: Through the command line interface, users have the capability to control the monitoring of backend environment data, allowing for real-time tracking and analysis of important metrics.&lt;/li>
&lt;li>Visualizing Environment Variables: Users can now visualize detected environment variables on the platform. Moreover, they can compare iterations within different buckets and gain more insights by observing the timeseries data, such as CPU usage, in a graphical format.&lt;/li>
&lt;/ol>
&lt;h2 id="challenges">Challenges&lt;/h2>
&lt;p>In the early stages of designing our platform, we encountered significant challenges at the system design level. One of the most daunting obstacles we faced was devising an effective method to monitor backend environment variables. To tackle this obstacle, we engaged in extensive discussions and sought guidance from our mentor. After careful consideration, we decided to adopt a multi-process approach to monitor the backend environment variables effectively. Specifically, we devised a meticulous strategy of creating a separate process in the background for each specific metric we needed to monitor. By allocating a dedicated process to each metric, we ensured a streamlined and efficient monitoring process.&lt;/p>
&lt;p>Currently, we are facing a challenge related to monitoring metrics. Since different users have varying monitoring requirements, it is impractical for us to manually write monitoring solutions for each user. To address this issue, we are actively working on implementing a pluggable design that allows users to configure their own monitoring preferences.&lt;/p>
&lt;p>Our approach involves providing users with the flexibility to define their custom configuration files or write monitoring programs following our documented guidelines. This way, users can specify the specific metrics they wish to monitor and tailor the monitoring process to their individual needs.&lt;/p>
&lt;h2 id="try-it-out">Try it Out!&lt;/h2>
&lt;p>As mentioned earlier, we have completed the core functionalities of our platform, and we would love to have you try it out and provide us with valuable feedback. Here are the links to our repositories where you can explore and experiment with our platform:&lt;/p>
&lt;ol>
&lt;li>&lt;a href="https://github.com/PublicExperimentDatabase/PublicExperimentGUI" target="_blank" rel="noopener">GUI Repository&lt;/a> and &lt;a href="https://github.com/PublicExperimentDatabase/PublicExperimentCLI" target="_blank" rel="noopener">CLI Repository&lt;/a>
&lt;ul>
&lt;li>In the README.md file of GUI repo, you will find detailed installation instructions to set up the Graphical User Interface (GUI). Follow the steps provided to get started with our platform.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="https://github.com/PublicExperimentDatabase/test-experiment" target="_blank" rel="noopener">Sample Repository&lt;/a>
&lt;ul>
&lt;li>In this repository, we have included scripts that allow you to run our program. Additionally, you can use these scripts as templates to monitor your own programs according to your specific requirements.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>We welcome you to take the platform for a test drive and feel free to raise any issues you encounter during the installation process. Your feedback is invaluable to us, as it helps us identify and address any potential installation challenges and improve the user experience.&lt;/p></description></item><item><title>Enhancing Drift Detection through Fine-Tuning Llama2</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/anl/perfdrift/20230730-kangrui/</link><pubDate>Sun, 30 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/anl/perfdrift/20230730-kangrui/</guid><description>&lt;p>Greetings everyone, I&amp;rsquo;m Kangrui. Over the past few weeks, we&amp;rsquo;ve dedicated our efforts and have consequently made significant progress in our drift detection methods. Now, I&amp;rsquo;m excited to present to you a detailed elaboration on how we prompted and fine-tuned Llama2 to efficiently carry out the drift detection task.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;h3 id="why-llm-in-drift-detection-method">Why LLM in drift detection method?&lt;/h3>
&lt;p>The use of large language models (LLMs) in drift detection methods presents numerous benefits that place it as a prominent solution in this domain.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Rapid Development:&lt;/strong> LLMs are in the vanguard of technological advancement. This field is evolving rapidly with continuous enhancements in model architecture, training techniques, and data handling. With every new version, these models are showing an increasing capacity to understand and generate human-like text, pushing the limits of what is achievable in Natural Language Processing (NLP) and Artificial Intelligence (AI) as a whole.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Superior Performance:&lt;/strong> Traditional drift detection methodologies such as Page-Hinkley, EDDM, and HDDM have their merits and have found success in numerous scenarios. Even Deep Learning (DL) techniques, like training a predictive model based on error rates, have made significant strides in the field. However, when handling complex, high-dimensional, and real-time data, LLMs have demonstrated exceptional results. They are not only able to effectively predict and respond to drifts but also adapt to new trends more swiftly. Our experiments using LLMs like GPT-3.5-turbo have yielded impressive results, notably outperforming other methods.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="GPT-3.5-turbo Performance" srcset="
/report/osre23/anl/perfdrift/20230730-kangrui/gpt-3.5-performance_hudb1929583c62f83e6182026371c0950a_147441_986c57531b096aac2ea5604c7942efed.webp 400w,
/report/osre23/anl/perfdrift/20230730-kangrui/gpt-3.5-performance_hudb1929583c62f83e6182026371c0950a_147441_534b4ca0b9e767d820ed9b45d754db9f.webp 760w,
/report/osre23/anl/perfdrift/20230730-kangrui/gpt-3.5-performance_hudb1929583c62f83e6182026371c0950a_147441_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/anl/perfdrift/20230730-kangrui/gpt-3.5-performance_hudb1929583c62f83e6182026371c0950a_147441_986c57531b096aac2ea5604c7942efed.webp"
width="760"
height="303"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>&lt;em>Fig. 1: Concept dirfts detected by GPT-3.5-turbo in Cori dataset&lt;/em>&lt;/p>
&lt;ol start="3">
&lt;li>&lt;strong>Flexibility:&lt;/strong> One of the major advantages of using LLMs is their flexibility in dealing with different types of input and output. In contrast to traditional methods, which are confined to single feature concept drift detection and can only process numerical values, LLMs can handle a range of input types including text, numbers, and more complex data structures. This capability allows them to detect multi-feature concept drifts, thereby broadening the scope and complexity of problems they can tackle. Moreover, the generation capability of LLMs can provide rich and detailed output, facilitating more comprehensive insights into the detected drifts.&lt;/li>
&lt;/ol>
&lt;h2 id="why-llama2-in-drift-detection-method">Why Llama2 in drift detection method?&lt;/h2>
&lt;p>Llama2 presents a series of advantages that make it an excellent choice for applying llm in drift detection. Here&amp;rsquo;s a breakdown of the key reasons:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Performance Guarantee:&lt;/strong> As a newly released model, Llama2 has undergone extensive development and testing, providing a reliable guarantee of performance. It represents the cutting edge in AI technology, having benefited from the latest research and advancements in language model design.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Accessibility Guarantee:&lt;/strong> One significant advantage of Llama2 is that it is open-source. It is readily accessible on HuggingFace, which also provides a range of mature tools to fine-tune and deploy the model.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Flexibility for Fine-Tuning:&lt;/strong> Llama2 comes in different sizes, such as 7B, 13B, and 75B parameters, which allows for flexibility in model selection based on the task&amp;rsquo;s requirements and computational resources.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="data">Data&lt;/h2>
&lt;h3 id="dataset">Dataset&lt;/h3>
&lt;p>In our study, we employed &lt;a href="https://github.com/alipsgh/data-streams" target="_blank" rel="noopener">Synthetic data streams&lt;/a> for the fine-tuning of Llama2. Synthetic data streams serve as an invaluable resource for controlled experiments in the domain of drift detection. These curated datasets encompass varied types of drifts, providing us with the capability to assess the efficacy of our detection algorithms under diverse scenarios.&lt;/p>
&lt;p>Here is a brief introduction to the synthetic datasets we used:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Sine1 &amp;amp; Sine2:&lt;/strong> These datasets induce abrupt concept drift within a two-dimensional feature space. The classification rule, a sine function, dictates the instance labels, which are flipped at every drift point.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Mixed:&lt;/strong> This dataset, characterized by its combination of numeric and boolean features, uses a composite classification rule. The abrupt concept drift is simulated via a periodic reversal of class labels.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Stagger:&lt;/strong> This categorical dataset incorporates abrupt concept drift by periodically altering the classification rules tied to the features.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Circles &amp;amp; LED:&lt;/strong> These datasets are designed to simulate gradual concept drift. In Circles, the classification of instances is determined by their spatial relation to specific circles. LED imitates a seven-segment digit display, introducing drift by interchanging the pertinent attributes.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>Typically, the synthetic datasets contain 100,000 or 1,000,000 instances. The concept drift happens every 25000 or 33333 instances each portraying either abrupt (with drifting period of 50 instances) or gradual concept drifts (with drifting period of 500 instances).&lt;/p>
&lt;h3 id="data-preprocessing-and-metrics">Data Preprocessing and Metrics&lt;/h3>
&lt;p>Given the token limit of Llama2 and the specific requirements of our project, we needed to transform the data into an appropriate format.&lt;/p>
&lt;p>As such, we processed each data stream into three sections: the &amp;lsquo;undrifted&amp;rsquo; period, the &amp;lsquo;drifting&amp;rsquo; period, and the &amp;lsquo;drifted&amp;rsquo; period. All instances in each section were randomly and independently drawn from the original data stream, summing up to a maximum of 100 instances. The number of instances for the undrifted and drifted periods ranged from 20 to 50, and for the drifting period, it ranged from 10 to 20.&lt;/p>
&lt;p>For instance, let&amp;rsquo;s consider a dataset containing 100,000 instances where the concept drift occurs every 25,000 instances, causing abrupt concept drift. To format a data point, we could draw 20 to 50 instances from the first 25,000 as the undrifted period. Then, we could draw 10 to 20 instances from the 25,001st to 25,050th instance as the drifting period. Finally, we would draw 10 to min(100 - num(undrifted period) - num(drifting period), 50) from the 25,051st to 50,050th instance as the drifted period. This newly formatted data stream would then be fed into Llama2.&lt;/p>
&lt;p>We also included some additional information to assist Llama2&amp;rsquo;s inference process. A typical data point in our processed dataset includes:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="p">{&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;before_period&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">31&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;transition_period&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">32&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">38&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;after_period&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">39&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">59&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;before_index&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">196&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">19963&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;transition_index&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">20002&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">20030&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;after_index&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="p">[&lt;/span>&lt;span class="mi">20310&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">39984&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;meta&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s2">&amp;#34;Dataset: MIXED&lt;/span>&lt;span class="se">\n\t&lt;/span>&lt;span class="s2">v&amp;#39;s type is nominal, range is (&amp;#39;False&amp;#39;, &amp;#39;True&amp;#39;)&lt;/span>&lt;span class="se">\n\t&lt;/span>&lt;span class="s2">w&amp;#39;s type is nominal, range is (&amp;#39;False&amp;#39;, &amp;#39;True&amp;#39;)&lt;/span>&lt;span class="se">\n\t&lt;/span>&lt;span class="s2">x&amp;#39;s type is numeric&lt;/span>&lt;span class="se">\n\t&lt;/span>&lt;span class="s2">y&amp;#39;s type is numeric&lt;/span>&lt;span class="se">\n\t&lt;/span>&lt;span class="s2">class&amp;#39;s type is nominal, range is (&amp;#39;p&amp;#39;, &amp;#39;n&amp;#39;)&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s2">&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;data_stream&amp;#34;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="o">...&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>From this dictionary, the &amp;ldquo;meta&amp;rdquo; and &amp;ldquo;data_stream&amp;rdquo; entries are fed into Llama2. The &amp;ldquo;transition_period&amp;rdquo; serves as the criterion: if Llama2&amp;rsquo;s answer lies within the &amp;ldquo;transition_period&amp;rdquo;, we deem it correct.&lt;/p>
&lt;h2 id="llama2">Llama2&lt;/h2>
&lt;h3 id="inference">Inference&lt;/h3>
&lt;p>We experimented with three variations of prompts during the inference phase.&lt;/p>
&lt;p>&lt;strong>Prompt Version 1:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">[INST] &amp;lt;&amp;lt;SYS&amp;gt;&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> You are a helpful, respectful, and honest assistant. Always provide the most helpful responses possible while ensuring safety. Ensure that your responses are socially unbiased, positive, and free from harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. If a question lacks coherence or sense, explain why instead of providing incorrect information. If you are uncertain about an answer, refrain from sharing false information.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;&amp;lt;/SYS&amp;gt;&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> Your task is to identify the index in a given data stream where the relationship between the features and labels begins to change. The data stream is formatted as a list, with each element being a two-element list: the first represents the features (also a list), and the second is the label. If your answer is &amp;#39;x&amp;#39;, it indicates that the data pattern starts shifting at the xth data point in the stream.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> Here&amp;#39;s an example of the data&amp;#39;s metadata: Dataset: SINE1
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> x&amp;#39;s type is numeric
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> y&amp;#39;s type is numeric
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> class&amp;#39;s type is nominal, range is (&amp;#39;p&amp;#39;, &amp;#39;n&amp;#39;)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> The given data stream is: [[[0.7, 0.07], &amp;#39;p&amp;#39;], [[0.45, 0.78], &amp;#39;n&amp;#39;], ..., [[0.64, 0.45], &amp;#39;n&amp;#39;]]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> Your task is to respond with a single index. No additional information is required.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">[/INST]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Prompt Version 2:&lt;/strong>&lt;/p>
&lt;p>The same as Prompt 1, but with a specific range for the index response:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">Please provide an index ranging from 0 to 96. No additional information is required.
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Prompt Version 3:&lt;/strong>&lt;/p>
&lt;p>This prompt uses an instruction-input-output design, which we adopted for fine-tuning:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">Below is an instruction paired with an input that provides further context. Write a response that appropriately completes the request.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">### Instruction:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Identify the index in a given data stream where the relationship between features and labels begins to change. The data stream is formatted as a list, each element being a two-element list: the first represents the features (also a list), and the second is the label. For instance, if the response is &amp;#39;x&amp;#39;, it means that the data pattern starts shifting at the xth data point in the stream. Only respond with an index, no further information is necessary.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">### Input:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Meta Data:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Dataset: SINE1
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> x&amp;#39;s type is numeric
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> y&amp;#39;s type is numeric
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> class&amp;#39;s type is nominal, range is (&amp;#39;p&amp;#39;, &amp;#39;n&amp;#39;)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">Data stream:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">[[[0.7, 0.07], &amp;#39;p&amp;#39;], [[0.45, 0.78], &amp;#39;n&amp;#39;], .., [[0.64, 0.45], &amp;#39;n&amp;#39;]]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">### Response:
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Despite minor differences between Prompt Version 1 and Version 2, both suggested by Meta, the results varied significantly, a topic we will delve into in the following section. Prompt Version 3, employing the instruction-input-output structure, was used during our fine-tuning process.&lt;/p>
&lt;h3 id="fine-tuning">Fine-Tuning&lt;/h3>
&lt;p>We utilized the tools provided by &lt;a href="https://github.com/facebookresearch/llama-recipes" target="_blank" rel="noopener">llama-recipes&lt;/a> to fine-tune Llama2. The key command used to initiate the fine-tuning process is illustrated below:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-shell" data-lang="shell">&lt;span class="line">&lt;span class="cl">python llama_finetuning.py --use_peft &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --peft_method lora &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --quantization &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --model_name meta-llama/Llama-2-13b-chat-hf &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --output_dir ./fine_tuned_model/Llama-2-13b-chat-hf-test_finetune &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --dataset alpaca_dataset &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --batch_size_training &lt;span class="m">40&lt;/span> &lt;span class="se">\
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="se">&lt;/span> --num_epochs &lt;span class="m">1&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Some explaination about the parameters:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">--use_peft: This flag indicates the use of the Parameter-Efficient Fine-Tuning (PEFT) method. PEFT allows us to fine-tune the model more efficiently.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">--peft_method lora: Here, we specify that the Lora (Layer-wise Optimal Brain Surgeon with Relevance-based Adjustment) method should be used for PEFT.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">--quantization: The quantization flag is used to reduce the memory footprint of the model during the inference stage. It does so by reducing the precision of the model&amp;#39;s weights.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">--dataset alpaca_dataset: Specifies the dataset setting used for fine-tuning, in this case, the &amp;#39;alpaca_dataset&amp;#39; indicates the instruction-input-output structure for fine-tuning.
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="results">Results&lt;/h2>
&lt;p>The performance of various models and prompt versions is depicted in Fig. 2.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="All Performance" srcset="
/report/osre23/anl/perfdrift/20230730-kangrui/performance_plot_hu026976f577cb17db71cb82cd3675225d_101027_f4b54b1d163428a3bbdd2373c5e7d6c6.webp 400w,
/report/osre23/anl/perfdrift/20230730-kangrui/performance_plot_hu026976f577cb17db71cb82cd3675225d_101027_ba09d14d8674a9735bf9bb60ce301dae.webp 760w,
/report/osre23/anl/perfdrift/20230730-kangrui/performance_plot_hu026976f577cb17db71cb82cd3675225d_101027_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/anl/perfdrift/20230730-kangrui/performance_plot_hu026976f577cb17db71cb82cd3675225d_101027_f4b54b1d163428a3bbdd2373c5e7d6c6.webp"
width="760"
height="608"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>&lt;em>Fig. 2: Performance comparison of different models and prompt versions.&lt;/em>&lt;/p>
&lt;p>It is evident from the results that the design of the prompt has a significant impact on Llama2&amp;rsquo;s performance. Furthermore, due to computational resource constraints, we have only managed to fine-tune Llama2 on a portion of our dataset (approximately 1,000 instances). The entire training set consists of 19,000 instances, and the test set includes 5,000 instances. Despite these limitations, a performance increase is noticeable after fine-tuning.&lt;/p></description></item><item><title>GPU Emulator for Easy Reproducibility of DNN Training -- Interim Blog Post</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/</link><pubDate>Sun, 30 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h4 id="motivation">Motivation&lt;/h4>
&lt;p>The growing popularity of Deep Neural Networks has resulted in a substantial increase in demand for Graphics Processing Units (GPUs). GPUs are crucial for conducting matrix computations in DNN training and inference. However, they are expensive to purchase for personal use, and the limited availability of GPU resources in public research clouds like Chameleon further exacerbates the issue. This scarcity of resources can cause delays in DNN-related research projects. Therefore, building an emulator can ameliorate the trouble of reserving GPUs, and the emulator can be modified to gather the profiles needed for optimization much quicker.&lt;/p>
&lt;h4 id="overture">Overture&lt;/h4>
&lt;p>The follwing sections will introduce the completed tasks and specify the details within each. The contents are briefly summarized and will try to present the necessary information only. We finished the following tasks:&lt;/p>
&lt;ul>
&lt;li>Literature Review&lt;/li>
&lt;li>Emulator implementation:
&lt;ul>
&lt;li>Time Profiling&lt;/li>
&lt;li>Pinned Memory&lt;/li>
&lt;li>Inter-GPUs Computation&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Reproducing Figures&lt;/li>
&lt;/ul>
&lt;p>I will introduce them and the importance of each one.&lt;/p>
&lt;h2 id="tasks--reason">Tasks + Reason&lt;/h2>
&lt;h4 id="literature-review">Literature Review&lt;/h4>
&lt;p>While waiting for the measurements, I started reading about other GPU-related papers, especially the ones about GPU Schedulers. We found that besides emulating computation and transfer time, we should also emulate the GPU memory profile in order to reproduce some other papers. Fortunately, it’s doable. In fact, without actually using a GPU, we can emulate many aspects of the GPU, more than just its timing. I found several papers that are reproducible theoretically, but they use Tensorflow while my current work targets Pytorch. Therefore I need to keep looking for the ones that use Pytorch.&lt;/p>
&lt;p>Afterwards, we started doing more paper reviews and looked over the papers about GPU Scheduling from 2018-2023 to see if we can reproduce figures from other papers. We went over 150 papers to search for the ones that do have implementation in PyTorch and the complemented GitHub page. We managed to find about 15 papers built in PyTorch and 6 of them were published on GitHub.&lt;/p>
&lt;p>We found the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs” and its GitHub page. The paper has three badges of “Artifacts Available, Evaluated, and Reproduced.” The paper’s content is implemented in PyTorch which means we can probably emulate this paper’s result with the emulator we already have by adding more features. We have started testing out to see if we can set up a similar environment and reproduce the experiments in the paper. After checking out the reproducibility of the paper, we will try to reproduce it using our emulator, and we might add new features to our emulator during this process.&lt;/p>
&lt;p>Firstly, I tried to reproduce the figures in the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs”, but stopped after a considerable number of attempts because the README was incomplete and too hard to follow. I first headed to the GitHub of the paper. I read the paper and understood that the GNN’s training was not the same as regular deep learning training, because it had input irregularity, and CoGNN helped better schedule the jobs to the machines by their algorithm. However, when I tried to install the software by the requirement of their environment README in order to reproduce the figures, many dependency issues were there, and barely any packages required were installed successfully. Their README in the software module was unclear on how to run the experiments too. Following the experiment setup did not give me the expected results. After a set of struggles with even completing one suggested experiment, we eventually decided to move on with other papers and abandoned this paper, reminding me the importance of reproducibility again.&lt;/p>
&lt;p>Secondly, we found another paper “Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent”. After reading the paper, we figured that the main focus was on distributing the resources (CPU, GPU) of the nodes to the jobs that were distributed by the Kubernetes Scheduler. In this way, there would be less GPU fragmentation and a higher utility rate of the resources. The paper used a simulator to simulate a large number of nodes and run the jobs by simulation. I successfully ran the experiments demonstrated in the repo and even created a smaller sample so that we could gain the result faster, because their original experiment takes 1020 times which will take about a month. However, when we dug deeper into their paper, we soon realized that their emulator is not a “real” one. Although their emulator is built off Kubernetes, the side where they used to create the figures are mere simulators and therefore doesn’t fit with our goal of emulating only GPU-related parts while running other real-system parts.&lt;/p>
&lt;h5 id="reason">Reason:&lt;/h5>
&lt;p>The purpose is to figure out which papers can be reproduced using the emulator, and what other features are needed for the emulator to work.&lt;/p>
&lt;h4 id="emulator-implementation">Emulator implementation&lt;/h4>
&lt;h5 id="time-profiling">Time Profiling&lt;/h5>
&lt;p>I did the performance profiling of different GPUs, which included CPU-to-GPU data transfer time and GPU computation time. These two elements will always be rather constant on GPUs so they can be easily emulated by profiling first and then utilized in the emulation. We did it for 6 different GPUs including k80, rtx6000, m40, a100pcie, v100, and p100.&lt;/p>
&lt;p>After having the performance profiling information of a few types of GPU nodes, I implemented the first naive version of the emulator. I used the profile recorded and sleep() function to represent the amount of time that each step needs to accomplish. Meanwhile, the time also varies with the command given so some simple arithmetics were implemented too. It’s implemented on a CPU node yet if we want to know the time profile of a GPU, we can still get them just like on a real GPU node.&lt;/p>
&lt;h5 id="reason-1">Reason:&lt;/h5>
&lt;p>The time profile collected can be compared with Data Wait Time to conduct research on minimizing pipeline stall across different GPUs and models.&lt;/p>
&lt;h5 id="pinned-memory">Pinned Memory&lt;/h5>
&lt;p>Pin memory threads – GPU-based Pytorch utilizes such threads to copy data from SHM to pinned memory, but CPU-based Pytorch doesn’t do so. Therefore, I need to implement an emulation of the pin mem threads. Fortunately, the data copy time is predictable. I have already found out that pin mem time has little to do with # of workers or the model type but only the batch size. I still need to find out if it has anything to do with the GPU nodes, which I assume not at this point.&lt;/p>
&lt;p>While implementing the features, We first emulated the CPU-to-GPU transfer time and GPU computation time for the p100 GPU based on the profiled information. Another CUDA behavior that requires emulation is that CUDA copies data from shared memory to pinned memory. In order to emulate it, we measured and emulated the time for copying such data (pinned memory). However, the emulator did not behave exactly as the real GPU. This was because we only emulated the time cost of using pinned_memory, but didn’t emulate its memory cost. In order to resolve the problem above, we wrote a CPython module to manually allocate page-locked memory (which behaves the same as CUDA’s pinned_memory). After we implemented this mechanism, the emulator’s fundamental functions were equipped and properly mimicked CUDA’s behaviors.&lt;/p>
&lt;h5 id="reason-2">Reason:&lt;/h5>
&lt;p>After collecting the GPU profile, I did a comparison with the actual GPU but noticed some differences in their IO time, meaning there was a difference between the emulation-based Pytorch and the actual GPU-based Pytorch.&lt;/p>
&lt;h5 id="inter-gpus-computation">Inter-GPUs Computation&lt;/h5>
&lt;p>We worked on the emulation of inter-GPU computation time in order to emulate Figure 9 in the DNN stall paper. This is one of the influential factors in multi-GPU training and we decided to first figure out how to implement this feature. As claimed in the paper, the larger the batch size, the less time it took to update the model. However, our current emulator would give out the same computation time since we have not added features to emulate inter-GPU behaviors. The smaller the batch size, more overheads were proven to be larger. The first step was to rent a lease that had 2 GPUs and saw the effects of inter-GPUs on computation time. We found that there was a small amount of overhead when running two GPUs instead of 1 GPU on the p100 node. My job was to find out where and how these overheads happened and find ways to emulate these features in order to reproduce Figure 9. We used resnet18, 4 workers, 10 batches to separately run 128 batch-size with 1 GPU (Group A) and 256 batch-size with 2 GPUs (Group B). With our current emulator, we would get the same computation time for both experiments to finish 1 batch. However, we saw that the computation time of Group B was longer than Group A, meaning there were some overheads in computation time. I then hacked into the source code of PyTorch and successfully figured out one part of the overhead contributing factors.&lt;/p>
&lt;h5 id="reason-3">Reason:&lt;/h5>
&lt;p>To better complete the emulator so that it can procide accurate emulation even when using more than 1 GPU on a machine.&lt;/p>
&lt;h4 id="reproducing-figures">Reproducing Figures&lt;/h4>
&lt;p>After implementing the emulator, we managed to use it to reproduce Figures 3, 4, 5, and 6 in the paper &lt;a href="chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://vldb.org/pvldb/vol14/p771-mohan.pdf">“Analyzing and Mitigating Data Stalls in DNN Training”&lt;/a> after a series of experiments and testing. It was noted that some environments in the paper were not the same as what we ran in the past week, but general patterns did apply to the expected hypothesis and measurements. We double checked all the data and figures produced and found out that our prototype meets our expectations, and it was time to look for other papers to reproduce to make the emulator more interesting.
The orginial comparing with the reproduced figures are demonstrated as below, you can notice that the patterns do reflect our expected results:
Original Figure 3:
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="original_figure3" srcset="
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure3_hu4825ad7506f6235ff41682e84b760224_101661_b55e965312579f5be79be0c6d21c853a.webp 400w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure3_hu4825ad7506f6235ff41682e84b760224_101661_def01bfdab18fc7d262d08c4c2388828.webp 760w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure3_hu4825ad7506f6235ff41682e84b760224_101661_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure3_hu4825ad7506f6235ff41682e84b760224_101661_b55e965312579f5be79be0c6d21c853a.webp"
width="710"
height="399"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
Reproduced Figure 3:
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="reproduced_figure3" srcset="
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure3_hu7ba68c9ce0c1781ae4d515ec33f3be68_90176_8b936fcb3ddfc9bb3592d7628c1f8641.webp 400w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure3_hu7ba68c9ce0c1781ae4d515ec33f3be68_90176_7e97d9d32a7ef0dc9e6fd3817d59f028.webp 760w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure3_hu7ba68c9ce0c1781ae4d515ec33f3be68_90176_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure3_hu7ba68c9ce0c1781ae4d515ec33f3be68_90176_8b936fcb3ddfc9bb3592d7628c1f8641.webp"
width="676"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
Original Figure 4:
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="original_figure4" srcset="
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure4_huf7912859d14522ea127e57f50e29e6e8_97339_981ac3e3445c520fd3934870e4eddab4.webp 400w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure4_huf7912859d14522ea127e57f50e29e6e8_97339_d783b7ce2dba6c9c3d988fc836f9f0ca.webp 760w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure4_huf7912859d14522ea127e57f50e29e6e8_97339_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure4_huf7912859d14522ea127e57f50e29e6e8_97339_981ac3e3445c520fd3934870e4eddab4.webp"
width="687"
height="390"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
Reproduced Figure 4:
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="reproduced_figure4" srcset="
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure4_hubf1eec72de057c79eda887bdba386155_82657_05469d8674723f8da1bb12f8f7e3e989.webp 400w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure4_hubf1eec72de057c79eda887bdba386155_82657_60c661b4eb6240ac2765c6540bcaf26c.webp 760w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure4_hubf1eec72de057c79eda887bdba386155_82657_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure4_hubf1eec72de057c79eda887bdba386155_82657_05469d8674723f8da1bb12f8f7e3e989.webp"
width="695"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="reproduced_figure4a" srcset="
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure4a_huce40b0ef347fbe8c5dda7cd64b89a82a_83431_6b8b84d9f461863c0223d3d2992f1557.webp 400w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure4a_huce40b0ef347fbe8c5dda7cd64b89a82a_83431_732e04a2227ba54fb9fb64e0dbe90717.webp 760w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure4a_huce40b0ef347fbe8c5dda7cd64b89a82a_83431_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure4a_huce40b0ef347fbe8c5dda7cd64b89a82a_83431_6b8b84d9f461863c0223d3d2992f1557.webp"
width="687"
height="701"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
Original Figure 5:
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="original_figure5" srcset="
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure5_hua788af616c21c63d77c26ece571c44f2_48006_c74d4ac38263a1767beef877888c7a0e.webp 400w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure5_hua788af616c21c63d77c26ece571c44f2_48006_01b1945b0dc6c47dee7de7ade64b50bc.webp 760w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure5_hua788af616c21c63d77c26ece571c44f2_48006_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure5_hua788af616c21c63d77c26ece571c44f2_48006_c74d4ac38263a1767beef877888c7a0e.webp"
width="549"
height="299"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
Reproduced Figure 5:
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="reproduced_figure5" srcset="
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure5_hu19c2897350abb8738bf9073d8758691e_57335_fdba37e325a306223f6a391e9a69a4ac.webp 400w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure5_hu19c2897350abb8738bf9073d8758691e_57335_8c9f78e75740fb2681a32c842527250b.webp 760w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure5_hu19c2897350abb8738bf9073d8758691e_57335_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure5_hu19c2897350abb8738bf9073d8758691e_57335_fdba37e325a306223f6a391e9a69a4ac.webp"
width="719"
height="750"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
Original Figure 6:
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="original_figure6" srcset="
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure6_hu28381534680632c929008cb2ca5db00c_55137_a24c7a3b53c18cb5a6f22a52536fd86f.webp 400w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure6_hu28381534680632c929008cb2ca5db00c_55137_5f6250df82bf4852c73c925bc3934b14.webp 760w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure6_hu28381534680632c929008cb2ca5db00c_55137_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/original_figure6_hu28381534680632c929008cb2ca5db00c_55137_a24c7a3b53c18cb5a6f22a52536fd86f.webp"
width="527"
height="304"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
Reproduced Figure 6:
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="reproduced_figure6" srcset="
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure6_hu932abfdff2db3be7f25d6ea6fa38efc4_57111_1c46178f0a8f8ccef8efa61f5fe40809.webp 400w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure6_hu932abfdff2db3be7f25d6ea6fa38efc4_57111_97e1110894b3975f840ad24bcbc0df12.webp 760w,
/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure6_hu932abfdff2db3be7f25d6ea6fa38efc4_57111_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230730-haoranwu/reproduced_figure6_hu932abfdff2db3be7f25d6ea6fa38efc4_57111_1c46178f0a8f8ccef8efa61f5fe40809.webp"
width="760"
height="626"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h5 id="reason-4">Reason:&lt;/h5>
&lt;p>Our origninal goal was to reproduce papers. Therefore, reproducing figures is a really good step to achieve that.&lt;/p>
&lt;h2 id="summary--coming-future">Summary + Coming Future&lt;/h2>
&lt;p>We will keep on trying to complete the emulator and figure out the exact mechanisms needed for the implementation. We will also seek for more features and see if it&amp;rsquo;s possible to add in better features into the emulator.&lt;/p></description></item><item><title>PDC Midterm Evaluation</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/lbl/pdc/20230802-nijwang/</link><pubDate>Sun, 30 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/lbl/pdc/20230802-nijwang/</guid><description>&lt;h2 id="mid-term-evaluation-update">Mid-Term Evaluation Update&lt;/h2>
&lt;p>Hello! I&amp;rsquo;m Nick, a GSoC contributor for the Proactive Data Containers (PDC) Project.
Over the past few weeks I&amp;rsquo;ve worked on verifying the functionality of the Python API for the PDC project and ensuring the smooth onboarding for new users of the data containers.&lt;/p>
&lt;p>I began by documenting the installation of the Ubuntu virtual machine in order to run the PDC repository, since the project wasn&amp;rsquo;t initially supported on Apple silicon hardware. The installation notes that I recorded for PDC would help contribute towards a more refined and precise process that can be seen updated on the github webpage.&lt;/p>
&lt;p>After installing the dependencies of the project onto the VM, I would begin maintaining the existing Python API and making changes that would allow the tests to compile and run successfully. The manual setup had a few problems with file directories paths that prevented the installation of a few files on new devices, which I fixed by manually by linking the path and removing a few header files. However, this proved to only be a temporary fix as the prior issues was evidence of a hardcoded path, which was resolved by some alteration and fishing in the source code.&lt;/p>
&lt;p>Now the PDC and PDCpy installations should go smoothly regardless of what OS is being used, and the instruction documentation can be found from the github page which should allow any user to access the data containers.&lt;/p></description></item><item><title>Building extensions between Python libraries for Biotechnology laboratories</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/labop/20230804-luhesketh/</link><pubDate>Fri, 28 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/labop/20230804-luhesketh/</guid><description>&lt;p>Hello again! This is Luiza, a GSoC contributor for the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsd/labop">LabOp&lt;/a> Project.
My task is to build bridges between programming languages for Biotechnology Laboratory automation.&lt;/p>
&lt;p>When talking about life sciences, reproducibility is a issue amongst most research centers. Biotechnology focused laboratories usually have their own protocols developed in house for their own applications. Researchers rely on such protocols to perform their experiments and collect data but when it comes to sharing those protocols and performing them in different laboratories many difficulties arise. Whether it is by lack of equipment, reagents or even by having different orders of execution, replicating a protocol in another laboratory is a challenge. To address this issue LabOp was developed to represent a protocol and convert it in many ways possible, so it can be executed by humans and by machines.&lt;/p>
&lt;p>PylabRobot and PyHamilton also come to the picture as such libraries exist to make it possible to write protocols for Hamilton robots(and Tecan machines as well for PylabRobot) but those libraries share the limitation of being able to only represent laboratory protocols at their lower levels, with the user having to write every single command in Python for the protocol to be executed. Thus I’m currently developing an extension for LabOp protocols to be converted into PylabRobot/PyHamilton scripts. This way the researcher writing the protocol can do it in a friendlier fashion, using human-friendly terms to write protocols for robot execution.&lt;/p>
&lt;figure id="figure-behaviourspecialization-for-liquid-handling-class">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="BehaviourSpecialization for Liquid Handling class" srcset="
/report/osre23/ucsd/labop/20230804-luhesketh/featured_hu4f9fc1ff392d0f6236dd97921cc62ee1_67178_7dea1005b9355831aab4fd48906afaec.webp 400w,
/report/osre23/ucsd/labop/20230804-luhesketh/featured_hu4f9fc1ff392d0f6236dd97921cc62ee1_67178_67bd573e81d4a87cd9d10cf5cb216d81.webp 760w,
/report/osre23/ucsd/labop/20230804-luhesketh/featured_hu4f9fc1ff392d0f6236dd97921cc62ee1_67178_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/labop/20230804-luhesketh/featured_hu4f9fc1ff392d0f6236dd97921cc62ee1_67178_7dea1005b9355831aab4fd48906afaec.webp"
width="760"
height="436"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption data-pre="Figure&amp;nbsp;" data-post=":&amp;nbsp;" class="numbered">
BehaviourSpecialization for Liquid Handling class
&lt;/figcaption>&lt;/figure>
&lt;p>The first step is building a correspondence spreadsheet with a hello world protocol written in both languages (LabOp | PylabRobot ). This way we can make an equivalence between the functions, parameters and default commands of both Libraries, as well as their structure. This spreadsheet will serve as guidance for the conversion of the Liquid handling steps from their representation in LabOp to their representation in Pylabrobot.&lt;/p>
&lt;p>The second step is to create a file that&amp;rsquo;ll do execute the conversion. In this file I will define a Labware map that&amp;rsquo;s basically a dictionary translating the resources LabOp names into Labware IDs recognizable by PylabRobots &amp;ldquo;resource&amp;rdquo; classes and a Behaviourspecialization class that should convert LabOp actions into PylabRobots Liquid Handler class operations as they&amp;rsquo;ll coordinate the commands sent from the script to the machines.(see featured images)&lt;/p>
&lt;figure id="figure-dictionary-for-labop-to-pylabrobot-container-correspondence">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Dictionary for LabOp to Pylabrobot container correspondence" srcset="
/report/osre23/ucsd/labop/20230804-luhesketh/featured_2_hu5b9f0fb3cb1cb61c40db218a2048e04a_278481_76e3dd3c112ca74ef8e3b7459123e154.webp 400w,
/report/osre23/ucsd/labop/20230804-luhesketh/featured_2_hu5b9f0fb3cb1cb61c40db218a2048e04a_278481_8337c1f75572828ec38252d4fdee0f96.webp 760w,
/report/osre23/ucsd/labop/20230804-luhesketh/featured_2_hu5b9f0fb3cb1cb61c40db218a2048e04a_278481_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/labop/20230804-luhesketh/featured_2_hu5b9f0fb3cb1cb61c40db218a2048e04a_278481_76e3dd3c112ca74ef8e3b7459123e154.webp"
width="760"
height="465"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption data-pre="Figure&amp;nbsp;" data-post=":&amp;nbsp;" class="numbered">
Dictionary for LabOp to Pylabrobot container correspondence
&lt;/figcaption>&lt;/figure>
&lt;p>Then we move to the protocol that will be tested on the Hamilton Machines, this is a Plasmid purification protocol that is usually performed by a human at a very lower level, one sample at a time. This limitation is not present on Hamilton robots as they can handle many samples at the same time with only one protocol execution. The robot that will be running this protocol has two modules that are not yet present in PylabRobot’s extensions, a pressure pump module and a on deck heatershaker. I’ll be implemmenting this modules in PylabRobot based on their default commands present in PyHamilton and run the protocol on a Hamilton Starlet unit.&lt;/p>
&lt;p>The steps of the protocol have been decoupled to facilitate the pilot testing, they are as follows:&lt;/p>
&lt;ul>
&lt;li>Liquid handling - GOOD TO GO&lt;/li>
&lt;li>Pressure pump module- requires adjustments&lt;/li>
&lt;li>plate grippers(necessary to move the plasmid plate from one module to another) - requires adjustment&lt;/li>
&lt;li>On deck heaterShaker- GOOD TO GO&lt;/li>
&lt;/ul>
&lt;p>The first pilot tests of the protocol will be run with water instead of plasmid to verify that all the steps are going smoothly, when that’s out of the way we will perform the protocol with dirty plasmids that require purification (which is what the protocol is for). The measurements for success will be sequencing the plasmid (if possible), performing a gel eletrophoresis and measuring absorbance of the DNA.&lt;/p>
&lt;p>The goal of this tests is to gather data from the efectiveness of the protocol and its execution on the machine, thus confirming that it is in fact a useful mechanism for DNA purification.&lt;/p></description></item><item><title>PolyPhy Infrastructure Enhancement</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230727-prashantjha/</link><pubDate>Thu, 27 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230727-prashantjha/</guid><description>&lt;p>As part of the Polyphy Project, my proposal was aimed at improving various aspects of the project, including CI/CD workflows, encapsulation, and security. Under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, I have made significant progress in the following areas:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Fixed GitHub CI Workflows and Release to PyPI:&lt;/strong>
During the first phase, I focused on refining the GitHub CI workflows by implementing new flows that facilitate seamless releases to PyPI. This ensures that the project can be easily distributed and installed by users, making it more accessible and user-friendly.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Encapsulation from Jupyter into Module:&lt;/strong>
I successfully encapsulated the code from Jupyter notebooks into a module. This step is crucial as it prepares the codebase to be released as a standalone module, making it easier for developers to use and integrate into their own projects.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>SonarCloud Integration for Better Code Analysis:&lt;/strong>
To ensure the codebase&amp;rsquo;s quality, I set up SonarCloud to perform comprehensive code analysis. This helps in identifying potential issues, bugs, and areas of improvement, leading to a more robust and reliable project.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Migration to Docker from Tox:&lt;/strong>
In order to improve the containerization process, I replaced the existing solution, Tox, with Docker. Docker provides better container management and ensures a consistent development and deployment environment across different platforms.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Research on Community Platforms for Self-Hosting:&lt;/strong>
I conducted extensive research on various community platforms suitable for self-hosting. This will enable the project to establish a thriving community and foster active collaboration among users and contributors.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Enhanced Security Measures:&lt;/strong>
I implemented several security improvements to safeguard the project and its users. These include setting up a comprehensive security policy, implementing secret scanning to prevent unintentional exposure of sensitive information, code scanning to identify potential vulnerabilities, private vulnerability reporting to handle security issues responsibly, and Dependabot integration for monitoring and managing dependencies.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Upgraded Taichi to Utilize Class-Based Features:&lt;/strong>
As part of the project&amp;rsquo;s development, I successfully upgraded Taichi to utilize class-based features available, thereby enhancing the codebase&amp;rsquo;s organization and maintainability.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>Moving forward, I plan to continue working diligently to achieve the goals outlined in my proposal. The improvements made during the first half of the GSoC program have laid a strong foundation for the project&amp;rsquo;s growth and success.&lt;/p>
&lt;p>Stay tuned for further updates and exciting developments as the project progresses!&lt;/p></description></item><item><title>Uncovering Actionable Insights using ReadTheDocs Analytics</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230727-luarss/</link><pubDate>Thu, 27 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230727-luarss/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Hello again! This is Jack, a GSoC contributor for the OpenROAD Project.
My task is to update and optimise the documentation to encourage user
adoption and engagement.&lt;/p>
&lt;p>For open-source repo maintainers, &lt;a href="https://readthedocs.org/" target="_blank" rel="noopener">readthedocs&lt;/a>
is a godsend. One of its more underrated features are in providing
search and traffic analytics of up to &lt;strong>90 days&lt;/strong> for the &lt;code>Community&lt;/code> tier
users. This is awesome, because ReadTheDocs is &amp;ldquo;always free for open source
and community projects&amp;rdquo;.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Why are analytics important?&lt;/p>
&lt;p>Analytics are great as a &lt;em>proxy&lt;/em> indicator for documentation engagement.
For instance, traffic to a page, could highlight how popular the tool is,
or it could also mean the tool is unclear and therefore people might need
more visits to the page to further understand usage. But overall,
it still indicates that the page needs to be taken care of due to the
increased visits.&lt;/p>
&lt;p>In what follows we aim to provide a quick tutorial as well as
list out some of the actionable insights we uncovered in the
OpenROAD/OpenROAD-flow-scripts documentation project.&lt;/p>
&lt;h2 id="preamble">Preamble&lt;/h2>
&lt;p>To download the analytics raw &lt;code>csv&lt;/code> files, refer to this
&lt;a href="https://docs.readthedocs.io/en/stable/analytics.html" target="_blank" rel="noopener">website&lt;/a>.&lt;/p>
&lt;p>You should also have the following packages installed: &lt;code>pandas&lt;/code>, &lt;code>numpy&lt;/code>, &lt;code>matplotlib&lt;/code>, &lt;code>scipy&lt;/code>.&lt;/p>
&lt;h2 id="traffic-analytics">Traffic Analytics&lt;/h2>
&lt;p>Traffic analytics are easy to understand.
It comes in the format &lt;code>Date&lt;/code>, &lt;code>Version&lt;/code>, &lt;code>Path&lt;/code>, &lt;code>DailyViews&lt;/code> as follows:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">read_csv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;ta_or.csv&amp;#39;&lt;/span>&lt;span class="p">)[::&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">reset_index&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Date&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Date&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">apply&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">split&lt;/span>&lt;span class="p">()[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">head&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure id="figure-figure-1-loading-traffic-analytics-dataframe">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Load traffic analytics DF" srcset="
/report/osre23/ucsd/openroad/20230727-luarss/pic1_hu3ae39bf6bc653845cdf52f284f9914c8_18120_0fe44b789026339d8a488b67e455af49.webp 400w,
/report/osre23/ucsd/openroad/20230727-luarss/pic1_hu3ae39bf6bc653845cdf52f284f9914c8_18120_c34649440686784f502a8fa245519fe8.webp 760w,
/report/osre23/ucsd/openroad/20230727-luarss/pic1_hu3ae39bf6bc653845cdf52f284f9914c8_18120_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230727-luarss/pic1_hu3ae39bf6bc653845cdf52f284f9914c8_18120_0fe44b789026339d8a488b67e455af49.webp"
width="420"
height="345"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 1: Loading traffic analytics dataframe
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>The raw data is not all that informative.
Let us aggregate the data to obtain the weekly views.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">weeklydf&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">copy&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">weeklydf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Date&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to_datetime&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">weeklydf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Date&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">-&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">to_timedelta&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">7&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">unit&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;d&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">weeklydf&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">weeklydf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">groupby&lt;/span>&lt;span class="p">([&lt;/span>&lt;span class="s1">&amp;#39;Path&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Grouper&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">key&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;Date&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">freq&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;W&amp;#39;&lt;/span>&lt;span class="p">)])[&lt;/span>&lt;span class="s1">&amp;#39;Views&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">sum&lt;/span>&lt;span class="p">()&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">reset_index&lt;/span>&lt;span class="p">()&lt;/span>\
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="o">.&lt;/span>&lt;span class="n">sort_values&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Date&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">weeklydf&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">weeklydf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Path&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="s1">&amp;#39;/index.html&amp;#39;&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure id="figure-figure-2-aggregated-weekly-traffic">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Aggregated weekly traffic" srcset="
/report/osre23/ucsd/openroad/20230727-luarss/pic2_hu4e5090e8319a278be3c23daaec31a810_14831_2356d16291dbea694b0bc9c05693ffe8.webp 400w,
/report/osre23/ucsd/openroad/20230727-luarss/pic2_hu4e5090e8319a278be3c23daaec31a810_14831_cf13de62f49742cd0e76c661feea93ed.webp 760w,
/report/osre23/ucsd/openroad/20230727-luarss/pic2_hu4e5090e8319a278be3c23daaec31a810_14831_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230727-luarss/pic2_hu4e5090e8319a278be3c23daaec31a810_14831_2356d16291dbea694b0bc9c05693ffe8.webp"
width="243"
height="393"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 2: Aggregated weekly traffic
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Note that we can replace the page path with any interesting page path
we desire. A useful command to obtain all possible page paths in this
dataset is to use:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">weeklydf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Path&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">unique&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure id="figure-figure-3-unique-paths-in-dataset">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Unique paths" srcset="
/report/osre23/ucsd/openroad/20230727-luarss/pic3_hub46945e77dd8a933670e33e4fea7dea8_54129_94dd6b47fa834b3c36ea619deffd3a6a.webp 400w,
/report/osre23/ucsd/openroad/20230727-luarss/pic3_hub46945e77dd8a933670e33e4fea7dea8_54129_f50b03560ab266073e2dee2fa7a04e51.webp 760w,
/report/osre23/ucsd/openroad/20230727-luarss/pic3_hub46945e77dd8a933670e33e4fea7dea8_54129_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230727-luarss/pic3_hub46945e77dd8a933670e33e4fea7dea8_54129_94dd6b47fa834b3c36ea619deffd3a6a.webp"
width="591"
height="538"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 3: Unique paths in dataset
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>With these neat data in our arsenal, let us do some plotting!
For the visualisation, we have chosen to use the traffic aggregated
on a daily scale. On top of this, we also plot a linear
best-fit line of all the points to track the trendline over time.&lt;/p>
&lt;p>The code below shows how to plot the top 20 pages.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">plot_views&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">df&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">numPages&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">20&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Groupby Path, sum views&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">pathResults&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">groupby&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Path&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Views&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sum&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sort_values&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">ascending&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fig&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">ax&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">subplots&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">numPages&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">figsize&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">15&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">30&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fig&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">tight_layout&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">for&lt;/span> &lt;span class="n">i&lt;/span> &lt;span class="ow">in&lt;/span> &lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">numPages&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">key&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pathResults&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">index&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">temp&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Path&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="n">key&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">scatter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">temp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Date&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">temp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Views&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_xticks&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">arange&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">90&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">7&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="c1"># this line is to not clutter the x-axis too much.&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_ylabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Views&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">set_title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">key&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># linear regression&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">temp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Date&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">temp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Views&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">bestfit&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">stats&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">linregress&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)),&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">bestfit&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">equation&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">bestfit&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s2">&amp;#34;x + &amp;#34;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">bestfit&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">poly1d&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">polyfit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">))(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">))),&lt;/span> &lt;span class="s1">&amp;#39;--&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">label&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">equation&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">ax&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">i&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">legend&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">loc&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;upper right&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure id="figure-figure-4-top-20-pages-by-daily-view-counts-in-descending-order">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Top 20 plots" srcset="
/report/osre23/ucsd/openroad/20230727-luarss/pic4_hu0544053b26ff363bea669ad03cb25a33_298754_208fbbf3fe9f3d6b7b48a8f44d65e70b.webp 400w,
/report/osre23/ucsd/openroad/20230727-luarss/pic4_hu0544053b26ff363bea669ad03cb25a33_298754_523ed86a22800eb3addad7738facd6cc.webp 760w,
/report/osre23/ucsd/openroad/20230727-luarss/pic4_hu0544053b26ff363bea669ad03cb25a33_298754_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230727-luarss/pic4_hu0544053b26ff363bea669ad03cb25a33_298754_208fbbf3fe9f3d6b7b48a8f44d65e70b.webp"
width="379"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 4: Top 20 pages by daily view counts (in descending order)
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Also, we can aggregate the total views by day to plot daily traffic:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">plot_daily_traffic&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">df&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># Groupby Date, sum views&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">fig&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">figure&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">figsize&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="mi">15&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">10&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dateResults&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">groupby&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Date&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Views&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sum&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">dateResults&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">index&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dateResults&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">values&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">scatter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xticks&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">arange&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">90&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">7&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ylabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Views&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Traffic by Day&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># linear regression&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">bestfit&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">stats&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">linregress&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)),&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">bestfit&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">equation&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">bestfit&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s2">&amp;#34;x + &amp;#34;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">bestfit&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">poly1d&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">polyfit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">))(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">))),&lt;/span> &lt;span class="s1">&amp;#39;--&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">label&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">equation&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">legend&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">loc&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;upper right&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure id="figure-figure-5-daily-aggregated-traffic">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Daily aggregated traffic" srcset="
/report/osre23/ucsd/openroad/20230727-luarss/pic5_hu284f436d507b391ad27b39b31846aa7d_24195_f1cfe4f85a6f52b10851153e3759601f.webp 400w,
/report/osre23/ucsd/openroad/20230727-luarss/pic5_hu284f436d507b391ad27b39b31846aa7d_24195_be83d71fe2635b895829f733ef678a4f.webp 760w,
/report/osre23/ucsd/openroad/20230727-luarss/pic5_hu284f436d507b391ad27b39b31846aa7d_24195_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230727-luarss/pic5_hu284f436d507b391ad27b39b31846aa7d_24195_f1cfe4f85a6f52b10851153e3759601f.webp"
width="760"
height="503"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 5: Daily aggregated traffic
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;h3 id="key-trends">Key Trends:&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Notice how there seems to be a cyclical pattern every week - rise
in average view counts during Mon-Fri, then a falloff on weekends.
This is most evident in the pages &lt;code>/index.html&lt;/code>, &lt;code>/main/README.html&lt;/code>.
This could be attributed to the standard work or study week of Mon-Fri.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>According to the gradient of the best-fit line for Figure 2,
there seems to be a slow decline of traffic for the OpenROAD docs.
For a gradient of -0.77, it translates roughly to decline of 22 views
per month. The small decline could be attributed to the higher traffic
from 19-29 March 2023, the dates for the
&lt;a href="https://openroaddesigncontest.org/" target="_blank" rel="noopener">OpenROAD 7nm design contest&lt;/a>.
Contest are always good for driving traffic.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="actionable-insights">Actionable insights:&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Top pages are usually landing pages: &lt;code>index.html&lt;/code>, &lt;code>main/README.html&lt;/code>, &lt;code>main/src/README.html&lt;/code>. We thus prioritised making these pages more readable and concise.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>This is followed by tutorial &lt;code>/tutorials/index.html&lt;/code> and &lt;code>/search.html&lt;/code>. The prominence of the tutorials page made us shift the tutorials link to a higher position on the left navigation sidebar. Search tips were also included to obtain better search results. More about search in the next section.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Next, as OpenROAD consists of 20 tools: traffic analytics helps us come up with an order to update: &lt;code>ifp&lt;/code>, &lt;code>gui&lt;/code>, &lt;code>odb&lt;/code>, &lt;code>ppl&lt;/code>, &lt;code>sta&lt;/code>, &lt;code>grt&lt;/code>, &lt;code>mpl&lt;/code>, &lt;code>gpl&lt;/code>, &lt;code>rsz&lt;/code>, &lt;code>rcx&lt;/code>. &lt;code>pdn&lt;/code>, &lt;code>cts&lt;/code>, &lt;code>psm&lt;/code>&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h2 id="search-analytics">Search Analytics&lt;/h2>
&lt;p>Search analytics come in the form of: &lt;code>Date&lt;/code>, &lt;code>Query&lt;/code>, &lt;code>TotalResults&lt;/code>.
Contrary to traffic analytics, &lt;code>TotalResults&lt;/code> do not refer to search count
for the query that day, but rather it corresponds to the total results
returned by that query on that day. Separate aggregation still needs to
be done to obtain the final count.&lt;/p>
&lt;p>Firstly, let us load the dataset and perform a groupby on the column &lt;code>Date&lt;/code>
to obtain the daily count aggregates.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pd&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">read_csv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;sa_or.csv&amp;#39;&lt;/span>&lt;span class="p">)[::&lt;/span>&lt;span class="o">-&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">]&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">reset_index&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">drop&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">True&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">rename&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">columns&lt;/span> &lt;span class="o">=&lt;/span>&lt;span class="p">{&lt;/span>&lt;span class="s1">&amp;#39;Created Date&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s1">&amp;#39;Date&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="s1">&amp;#39;Total Results&amp;#39;&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="s1">&amp;#39;TotalResults&amp;#39;&lt;/span>&lt;span class="p">})&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Date&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Date&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">apply&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="k">lambda&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">x&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">split&lt;/span>&lt;span class="p">()[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">dateResults&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">groupby&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Date&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">TotalResults&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">dateResults&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure id="figure-figure-6-code-output-for-daily-aggregated-search-counts">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Daily count code" srcset="
/report/osre23/ucsd/openroad/20230727-luarss/pic6_hufe284b2003be927a09036a17b0f147ed_7438_303764681c719b59422e8ac4adff87d5.webp 400w,
/report/osre23/ucsd/openroad/20230727-luarss/pic6_hufe284b2003be927a09036a17b0f147ed_7438_ae0b89dd9a05f1d083e0a5caf434a1c6.webp 760w,
/report/osre23/ucsd/openroad/20230727-luarss/pic6_hufe284b2003be927a09036a17b0f147ed_7438_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230727-luarss/pic6_hufe284b2003be927a09036a17b0f147ed_7438_303764681c719b59422e8ac4adff87d5.webp"
width="390"
height="231"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 6: Code output for daily aggregated search counts.
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>Now we are ready to plot the daily aggregated searches. This represents
the number of times a search was performed on the documentation website.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">plot_daily_searches&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">df&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dateResults&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">groupby&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Date&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">TotalResults&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">dateResults&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">index&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">dateResults&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">values&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">scatter&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">x&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">xticks&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">arange&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="mi">90&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">7&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">ylabel&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;# Times Searched&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">title&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Search count by day&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># linear regression&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">bestfit&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">stats&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">linregress&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)),&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">bestfit&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">equation&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">bestfit&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">0&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="s2">&amp;#34;x + &amp;#34;&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="nb">str&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">round&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">bestfit&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span>&lt;span class="mi">2&lt;/span>&lt;span class="p">))&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">plot&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">poly1d&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">polyfit&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="n">y&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">))(&lt;/span>&lt;span class="nb">range&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="nb">len&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">y&lt;/span>&lt;span class="p">))),&lt;/span> &lt;span class="s1">&amp;#39;--&amp;#39;&lt;/span>&lt;span class="p">,&lt;/span>&lt;span class="n">label&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">equation&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">plt&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">legend&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">loc&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s1">&amp;#39;upper right&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>
&lt;figure id="figure-figure-7-daily-aggregated-search-counts">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Final search analytics graph" srcset="
/report/osre23/ucsd/openroad/20230727-luarss/pic7_huaf6e8114aa38a4f77afbcf6239df4596_24960_dfcee10fa9be516c148eb11ac3598591.webp 400w,
/report/osre23/ucsd/openroad/20230727-luarss/pic7_huaf6e8114aa38a4f77afbcf6239df4596_24960_2bfda1034e5a343c34c529e62f8279ba.webp 760w,
/report/osre23/ucsd/openroad/20230727-luarss/pic7_huaf6e8114aa38a4f77afbcf6239df4596_24960_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230727-luarss/pic7_huaf6e8114aa38a4f77afbcf6239df4596_24960_dfcee10fa9be516c148eb11ac3598591.webp"
width="760"
height="507"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Figure 7: Daily aggregated search counts
&lt;/figcaption>&lt;/figure>
&lt;/p>
&lt;p>We can also do an additional plot for queries that return zero results.
In other words, we are interested in the terms people are curious about;
but is not covered by our documentation currently.
Think of it as an on-site search engine optimisation.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">zeroResults&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">df&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">df&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">TotalResults&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">]&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">zeroResults&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">zeroResults&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">groupby&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;Query&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Date&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">count&lt;/span>&lt;span class="p">()&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">sort_values&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">ascending&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">False&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s1">All 0 results queries (desc)&lt;/span>&lt;span class="se">\n&lt;/span>&lt;span class="s1">&amp;#39;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="nb">print&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">zeroResults&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">index&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">tolist&lt;/span>&lt;span class="p">())&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Example output as follows:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">[&amp;#39;autotuner&amp;#39;, &amp;#39;tdms&amp;#39;, &amp;#39;*macro*&amp;#39;, &amp;#39;rtlmp_max_inst&amp;#39;, &amp;#39;get_property&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;check_setup&amp;#39;, &amp;#39;centos&amp;#39;, &amp;#39;initialize_padring&amp;#39;, &amp;#39;core_utilization&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;pin_access&amp;#39;, &amp;#39;read_libraries&amp;#39;, &amp;#39;config&amp;#39;, &amp;#39;eco&amp;#39;, &amp;#39;rpt&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;improve_placement&amp;#39;, &amp;#39;define_process_corner&amp;#39;, &amp;#39;global_place&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;report_worst_slack&amp;#39;, &amp;#39;max_phi_cof&amp;#39;, &amp;#39;report_power&amp;#39;, &amp;#39;get_pins&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;registerfile&amp;#39;, &amp;#39;set_global_routing&amp;#39;, &amp;#39;prebuilt&amp;#39;, &amp;#39;env&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;repair_clock_inverters&amp;#39;, &amp;#39;set_thread_count&amp;#39;, &amp;#39;report_&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;partition_design&amp;#39;, &amp;#39;place_cell&amp;#39;, &amp;#39;blockage&amp;#39;, &amp;#39;partitionmgr&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;nmos&amp;#39;, &amp;#39;tuner&amp;#39;, &amp;#39;write_sdf&amp;#39;, &amp;#39;place_density&amp;#39;, &amp;#39;place_pins_args&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;size_cell&amp;#39;, &amp;#39;*macor*&amp;#39;, &amp;#39;repair_clock_inverter&amp;#39;, &amp;#39;misk&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;readhaty&amp;#39;, &amp;#39;readhat&amp;#39;, &amp;#39;obstruct&amp;#39;, &amp;#39;odbpy&amp;#39;, &amp;#39;openpdn&amp;#39;, &amp;#39;openram&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;placement_cfg&amp;#39;, &amp;#39;read_macro_placement&amp;#39;, &amp;#39;output_drc&amp;#39;, &amp;#39;positon&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;pct&amp;#39;, &amp;#39;qrctechtable&amp;#39;, &amp;#39;qrctechfile&amp;#39;, &amp;#39;qrctech&amp;#39;, &amp;#39;qrc&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;properly covered&amp;#39;, &amp;#39;precision innovations&amp;#39;, &amp;#39;repeater&amp;#39;, &amp;#39;&amp;#34;rcx-0487&amp;#34;&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;report_worst&amp;#39;, &amp;#39;report_area&amp;#39;, &amp;#39;report_clock_properties&amp;#39;, &amp;#39;skywater&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;study&amp;#39;, &amp;#39;sv&amp;#39;, &amp;#39;synth&amp;#39;, &amp;#39;synth_hierarchical&amp;#39;, &amp;#39;systemverilog&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;tdm&amp;#39;, &amp;#39;tdms_place&amp;#39;, &amp;#39;triton&amp;#39;, &amp;#39;ungroup&amp;#39;, &amp;#39;verilog_files&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;wrc&amp;#39;, &amp;#39;write_lef&amp;#39;, &amp;#39;write_partition_verilog&amp;#39;, &amp;#39;שואם&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;si2&amp;#39;, &amp;#39;sever&amp;#39;, &amp;#39;setrc&amp;#39;, &amp;#39;rtl_macro&amp;#39;, &amp;#39;report_dcalc&amp;#39;, &amp;#39;report_design&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;report_design_info&amp;#39;, &amp;#39;report_instance&amp;#39;, &amp;#39;report_slews&amp;#39;, &amp;#39;resize&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;rtlmp&amp;#39;, &amp;#39;set_power_activity&amp;#39;, &amp;#39;rtree&amp;#39;, &amp;#39;run_all&amp;#39;, &amp;#39;run_all.tcl&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;sc&amp;#39;, &amp;#39;set_all_input_output_delays&amp;#39;, &amp;#39;set_io_pin_constraints&amp;#39;, &amp;#39;metis&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;lefdef&amp;#39;, &amp;#39;make_result_file&amp;#39;, &amp;#39;macro_placement_cfg&amp;#39;, &amp;#39;clock__details&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;clocks__details&amp;#39;, &amp;#39;combinational&amp;#39;, &amp;#39;config.mk&amp;#39;, &amp;#39;coord&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;core_margin&amp;#39;, &amp;#39;db_process_node&amp;#39;, &amp;#39;dbblocjs&amp;#39;, &amp;#39;dbdatabase&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;dbr&amp;#39;, &amp;#39;dbrt&amp;#39;, &amp;#39;dbrttree&amp;#39;, &amp;#39;debian&amp;#39;, &amp;#39;define_pin_shape&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;densiy&amp;#39;, &amp;#39;desgin&amp;#39;, &amp;#39;diff_file&amp;#39;, &amp;#39;clk_period&amp;#39;, &amp;#39;clk_io_ptc&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;cdl&amp;#39;, &amp;#39;analog&amp;#39;, &amp;#39;./env.sh&amp;#39;, &amp;#39;178&amp;#39;, &amp;#39;6_final&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;6_final.odb&amp;#39;, &amp;#39;_placement&amp;#39;, &amp;#39;abat&amp;#39;, &amp;#39;add_stripe&amp;#39;, &amp;#39;arch&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;ccs&amp;#39;, &amp;#39;binaries&amp;#39;, &amp;#39;bookshelf&amp;#39;, &amp;#39;buff_cell&amp;#39;, &amp;#39;buildwithdocker&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;busbitchars&amp;#39;, &amp;#39;buschar&amp;#39;, &amp;#39;captable&amp;#39;, &amp;#39;directoryobject&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;disallow_one_site_gaps&amp;#39;, &amp;#39;distribute&amp;#39;, &amp;#39;is_port&amp;#39;, &amp;#39;hierarch&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;hop&amp;#39;, &amp;#39;hyper&amp;#39;, &amp;#39;initialie_flooorplan&amp;#39;, &amp;#39;initialize_flooorplan&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;instance_count&amp;#39;, &amp;#39;is_chip&amp;#39;, &amp;#39;lean&amp;#39;, &amp;#39;gui_final&amp;#39;, &amp;#39;lec&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;*def*&amp;#39;, &amp;#39;limitation&amp;#39;, &amp;#39;lyp&amp;#39;, &amp;#39;maco&amp;#39;, &amp;#39;macro_pin&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;macro_place&amp;#39;, &amp;#39;harness&amp;#39;, &amp;#39;gui.py&amp;#39;, &amp;#39;dont&amp;#39;, &amp;#39;fill_cell&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;dreamplace&amp;#39;, &amp;#39;em&amp;#39;, &amp;#39;enable_dpo&amp;#39;, &amp;#39;energy&amp;#39;, &amp;#39;env.sh&amp;#39;, &amp;#39;erc&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;export&amp;#39;, &amp;#39;findmaste&amp;#39;, &amp;#39;grt_layer_adjustments&amp;#39;, &amp;#39;findmaster&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;freepdk45&amp;#39;, &amp;#39;gdt&amp;#39;, &amp;#39;global_&amp;#39;, &amp;#39;global_place_db&amp;#39;,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;global_placementy&amp;#39;, &amp;#39;graph&amp;#39;, &amp;#39;갲&amp;#39;]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>For our case we can roughly the problem with these zero-result queries fall
under one of these categories:&lt;/p>
&lt;ul>
&lt;li>Missing documentation: Either the parameter of functionality&lt;/li>
&lt;li>Typo: User has the right keyword, but did not type it correctly. We will therefore provide them with search &lt;a href="https://openroad-flow-scripts.readthedocs.io/en/latest/user/FAQS.html#how-do-i-get-better-search-results" target="_blank" rel="noopener">tips&lt;/a> such as using fuzziness &lt;code>~N&lt;/code> operator for better matches.&lt;/li>
&lt;/ul>
&lt;h2 id="future-work">Future Work&lt;/h2>
&lt;p>ReadTheDocs could also be linked with
&lt;a href="https://analytics.google.com/analytics/web/provision/#/provision" target="_blank" rel="noopener">Google Analytics&lt;/a>,
but this remains for more advanced users.&lt;/p>
&lt;p>Another rich source of information helpful to open-source maintainers
are GitHub issues. These are the direct platform where users discuss
their problems. Another great way to track documentation engagement
is to use metrics such as: installation issues per unit week,
or user-issue retention rate, which tracks the number of users
that continue to file issues after their first.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>This post showcases the amount of insight one can gather from parsing
traffic and search analytics. It also provides useful Python functions
that can be applied to the analytics dataset for fast prototyping
and experimentation. If you are a contributor to open-source projects,
try uncovering some insights for your doc pages today!&lt;/p></description></item><item><title>Halfway Through GSOC: My Experience and Learnings</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edumle/20230718-kokoedwin/</link><pubDate>Mon, 17 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edumle/20230718-kokoedwin/</guid><description>&lt;p>Hello there! I&amp;rsquo;m Jonathan Edwin, all the way from the beautiful archipelago of Indonesia. This year, I got the exciting chance to jump on board the 2023 Summer of Reproducibility initiative. It&amp;rsquo;s been quite the adventure! Right now, I&amp;rsquo;m pouring my energy into a fascinating project titled &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml">Using Reproducibility in Machine Learning Education&lt;/a> project. I&amp;rsquo;m thrilled to be able to make my own little mark on it.&lt;/p>
&lt;p>For those of you who are not familiar with what I&amp;rsquo;m working on, let me shed some light. My project, as part of the &amp;ldquo;Using Reproducibility in Machine Learning Education&amp;rdquo; initiative under guidance of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a>, focuses on creating educational resources that center around reproducing some key machine learning techniques. These include Cutout data augmentation, U-Net, and Siamese networks, to name a few. The end product will be a series of interactive Jupyter notebooks that provide step-by-step guidance for students, helping them not only understand these complex models but also gain hands-on experience in achieving research reproducibility.&lt;/p>
&lt;p>&lt;strong>Progress and Challenges&lt;/strong>&lt;/p>
&lt;p>Embarking on this project, I dove headfirst into the world of Cutout data augmentation, immersing myself in the many experiments outlined in the foundational paper. This initial study proved to be an intricate blend of multiple datasets, two network architectures, and a performance evaluation of models with and without Cutout data augmentation. Additionally, it included the exploration of these models in combination with other data augmentation techniques.&lt;/p>
&lt;p>One of our main objectives has been to help students visualize how the model interacts with the data, and for this, we&amp;rsquo;ve been leveraging a tool called Grad-CAM. The initial paper provided a rich landscape for exploration and learning, leading us to segment our journey into five interactive Jupyter notebooks - Introduction, CutOut, ResNet, WideResNet, Regularization, and Grad-CAM.&lt;/p>
&lt;p>I&amp;rsquo;m excited to share that, as we&amp;rsquo;ve hit the mid-term milestone, I&amp;rsquo;ve managed to make significant strides and completed the notebooks up to the WideResNet section. It&amp;rsquo;s been a journey full of learning and growth, overcoming various challenges along the way - understanding the intricacies of the experiments, deconstructing complex architectures, and distilling all this into digestible, interactive notebooks for students. Despite the challenges, the process has been incredibly rewarding. As we gear up for the next half of the project, I&amp;rsquo;m eager to tackle the remaining sections and share my work with the community.&lt;/p>
&lt;p>&lt;strong>Learnings and Skills Gained&lt;/strong>&lt;/p>
&lt;p>&lt;em>&lt;strong>Embracing the Iterative Process of Open Source Development&lt;/strong>&lt;/em>: My initial foray into open source development had me writing and running code in one environment, then copying parts of it to another environment and pushing it from there to GitHub. This occasionally led to mistakes during the code migration. However, I&amp;rsquo;ve since learned to write or change a little bit of code, run the new version directly from GitHub, catch errors, and improve. In open source development, the end goal is to ensure everything works flawlessly, even if it involves several iterations. This is especially true considering the code from GitHub might directly run on platforms like Chameleon or Google Colab.&lt;/p>
&lt;p>&lt;em>&lt;strong>Understanding the Distinction between Reproducing Experiments and Crafting Educational Content&lt;/strong>&lt;/em>: There&amp;rsquo;s a stark difference between merely reproducing an experiment from a research paper and creating an educational resource around that experiment. The former generally involves cloning and running the code, verifying it against the claims in the paper with minimal modifications. The latter, however, necessitates adapting and simplifying the code, regardless of the learner&amp;rsquo;s skill level, to ensure their comprehension. It&amp;rsquo;s about carefully guiding learners through each step for a more profound understanding.&lt;/p>
&lt;p>&lt;em>&lt;strong>The Power of &amp;lsquo;Show, Don’t Tell&amp;rsquo;&lt;/strong>&lt;/em>: This priceless lesson was imparted by my mentor, Ms. Fraida Fund. Rather than telling me what to do when I erred or needed to learn something new, she demonstrated the correct way first-hand. This hands-on approach made understanding far easier. This principle is also reflected in the creation of our notebooks. For instance, we chose to include the Grad-CAM notebook. Although not directly referenced in the paper, it offers students a clear visual understanding of the impact of the Cutout technique, embodying the &amp;ldquo;show, don’t tell&amp;rdquo; philosophy.&lt;/p>
&lt;p>&lt;strong>Next Steps&lt;/strong>&lt;/p>
&lt;p>As we step into the second half of this thrilling journey, our primary goal is to complete the remaining sections of our Cutout project. We&amp;rsquo;re setting our sights on the final notebook - Grad-CAM. The Grad-CAM notebook will offer a visual exploration of how our models interpret and interact with data, thereby solidifying the students&amp;rsquo; understanding of Cutout data augmentation. So, stay tuned for more as we plunge into these fascinating topics!&lt;/p>
&lt;p>&lt;strong>Conclusion&lt;/strong>&lt;/p>
&lt;p>Looking back, my time with the Summer of Reproducibility initiative has been nothing short of a profound learning experience. Working on the &amp;ldquo;Using Reproducibility in Machine Learning Education&amp;rdquo; project has been both challenging and rewarding, and I am incredibly grateful for this opportunity.&lt;/p>
&lt;p>I&amp;rsquo;ve gained valuable insights into open-source development, delved deeper into the intricacies of machine learning techniques, and experienced firsthand the transformative power of a &amp;lsquo;show, don&amp;rsquo;t tell&amp;rsquo; teaching approach. Moreover, I&amp;rsquo;ve learned that the creation of educational resources requires a delicate balance between preserving the essence of original research and adapting it to foster easy understanding.&lt;/p>
&lt;p>As we press forward, I&amp;rsquo;m excited about the prospects of the coming weeks. The completion of the Grad-CAM notebook lies ahead, marking the final pieces of our Cutout project. Beyond this project, the skills and lessons I&amp;rsquo;ve acquired during this initiative will undoubtedly guide me in future endeavours.&lt;/p>
&lt;p>I can confidently say that my GSOC journey has been a remarkable chapter in my growth as a developer and researcher. Here&amp;rsquo;s to more learning, more coding, and more breakthroughs in the future!&lt;/p></description></item><item><title>Reproducible Analysis &amp; Models for Predicting Genomics Workflow Execution Time</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230712-shayantan/</link><pubDate>Wed, 12 Jul 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230712-shayantan/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/">Reproducible Analysis &amp;amp; Models for Predicting Genomics Workflow Execution Time&lt;/a> my &lt;a href="https://drive.google.com/file/d/1N81dqvdTDcKjz5WDAUCdf5yi1BNR9Au6/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/in-kee-kim/">In Kee Kim&lt;/a>, Martin Putra and collaborator &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/charis-christopher-hulu/">Charis Christopher Hulu&lt;/a> (another OSRE fellow) aims to analyze large-scale sequencing datasets in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times.&lt;br>
Recent advancements in Next-Generation Sequencing (NGS) technologies have resulted in massive amounts of nucleotide sequence data and automated genomic workflows to streamline analysis and data interpretation. The success of NGS-driven research has also led to a sudden increase in data of varying size and complexity, making it more time-consuming for researchers to test hypotheses. Analyzing
high-throughput genomic data requires a step-by-step execution of dedicated tools - also known as workflows. The first step toward the execution of a typical genomic analysis workflow is quality control
of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Prior work in this area has suggested that the runtimes of genomic workflows get affected due to qualitative differences in the data. Additionally, there is very little consensus on what constitutes “input quality” regarding data from large genomic experiments. In this proposal, we hypothesize that genomic data quality significantly impacts the genomic workflows’ execution time. We aim to leverage machine learning techniques to extract predictive features from quality control tools that robustly predict workflow execution time.&lt;/p></description></item><item><title>Highlighting and Formatting Pyrope HDL</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/livehd/20230526-rbaxt/</link><pubDate>Thu, 22 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/livehd/20230526-rbaxt/</guid><description>&lt;p>As part of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/livehd">Micro Architecture Santa Cruz (MASC)&lt;/a> my &lt;a href="https://drive.google.com/file/d/1aJIF-geNoN49zjkFS1W7yur2-rYCxhrt/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of Jose Renau aims to develop syntax highlighting and a vertical alignment tool for Pyrope. Pyrope is a modern hardware description language under development by MASC. Code is parsed with the &lt;a href="https://github.com/masc-ucsc/tree-sitter-pyrope/tree/main" target="_blank" rel="noopener">tree-sitter grammar for Pyrope&lt;/a>. I am working on developing a query file for the nvim-treesitter plugin. This gives neovim users Pyrope syntax highlighting based on the parse tree. In addition to syntax highlighting, I am working on a vertical alignment tool to improve code readability. These features will improve the usability and convenience of Pyrope.&lt;/p></description></item><item><title>Proactive Data Containers</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/lbl/pdc/20230620-nijwang/</link><pubDate>Tue, 20 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/lbl/pdc/20230620-nijwang/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lbl/pdc">Proactive Data Containers (PDC)&lt;/a> my &lt;a href="https://docs.google.com/document/d/1Pnt-iq9pWD70d_jmSsoJjnbXtIjJGY3IbXFrwyFT4Q4/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/houjun-tang/">Houjun Tang&lt;/a> aims to novel data abstraction for managing science data in an object-oriented manner. PDC&amp;rsquo;s will provide efficient strategies for moving data in deep storage hierarchies and techniques for transforming and reorganizing data based on application requirements. The functionality of the container object themselves are already well developed, so my goal will be to verify the functionality tests regarding the Python API to ensure that it can be used with ease, as well as create command line tools so that it is a complete data object that can be used across platforms and is simple and helpful for the users.&lt;/p></description></item><item><title>Public Artifact Data and Visualization</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20230617-zjyhhhhh/</link><pubDate>Sat, 17 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/intel/artifactviz/20230617-zjyhhhhh/</guid><description>&lt;p>Hello! As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/intel/artifactviz">Public Artifact Data and Visualization&lt;/a> our proposals (&lt;a href="https://drive.google.com/file/d/1egIQDLMQ5eV7Uc-S55-GTiSXdmrC3_Pj/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> from &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jiayuan-zhu/">Jiayuan Zhu&lt;/a> and &lt;a href="https://drive.google.com/file/d/1Gf68Pz8v3YjcQ1sWkS9n2hnl7_lsme2l/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> from &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/krishna-madhwani/">Krishna Madhwani&lt;/a>) under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjo-vahldiek-oberwagner/">Anjo Vahldiek-Oberwagner&lt;/a> aims to design a system that allows researchers to conveniently record and compare the environmental information, such as CPU utilization, of different iterations and versions of code during an experiment.&lt;/p>
&lt;p>In academic experiments, there is often a need to compare results and performance between different iterations and versions. This comparative analysis helps researchers evaluate the impact of different experimental parameters and algorithms on the results and enables them to optimize experimental design and algorithm selection. However, to conduct effective comparative analysis, it is essential to record and compare environmental information, alongside the experimental data. This information provides valuable insights into the factors that may influence the observed outcomes.&lt;/p>
&lt;p>Through this summer, we aim to develop a system that offers a streamlined interface, enabling users to effortlessly monitor their running programs using simple command-line commands. Moreover, our system will feature a user-friendly dashboard where researchers can access historical runtime information and visualize comparisons between different iterations. The dashboard will present comprehensive graphs and charts, facilitating the analysis of trends and patterns in the environmental data.&lt;/p></description></item><item><title>Interactive Exploration of High-dimensional Datasets with PolyPhy and Polyglot</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230616-kirandeol/</link><pubDate>Fri, 16 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230616-kirandeol/</guid><description>&lt;p>Hello! My name is Kiran and this summer I&amp;rsquo;ll be working with &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/polyphy">Polyphy&lt;/a> and &lt;a href="https://normand-1024.github.io/Bio-inspired-Exploration-of-Language-Embedding/" target="_blank" rel="noopener">Polyglot&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>.
The full &lt;a href="https://drive.google.com/file/d/1iwKU938uzUHn0oY2tM0jPADOYoF0kqbh/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> is available online.&lt;/p>
&lt;p>For a brief overview, the Polyglot app allows users to interact with a 3D network of high-dimensional language embeddings, specfically the
&lt;a href="http://vectors.nlpl.eu/repository/" target="_blank" rel="noopener">Gensim Continuous Skipgram result of Wikipedia Dump of February 2017 (296630 words)&lt;/a> dataset. The high-dimensional
embeddings are reduced to 3 dimensions using UMAP. The novel &lt;a href="https://iopscience.iop.org/article/10.3847/2041-8213/ab700c/pdf" target="_blank" rel="noopener">MCPM slime mode metric&lt;/a> is then used
to compute the similarty levels between points (much like how you might compute the Euclidean distance between two points). These similarity levels are used
to filter the network and enable users to find interesting patterns in their data they might not find using quantitative methods alone. For example, the network has
a distinct branch in which only years are nearby! Users might find other clusters, such as ones with sports words or even software engineering words.
Although such exploration may not lead to quantitatively significant conclusions, the ability to explore and test mini hypotheses about the data can lead to
important insights that go on to incite quantitatively significant conclusions.&lt;/p>
&lt;p>In our project, we aim to expand Polyglot such that any user can upload their own data, once they have computed the MCPM metric using PolyPhy. This will have
important applications in building trust in our data and embeddings. This could also help with research on the MCPM metric, which presents a new, more naturalistic
way of computing similarity by relying on the principle of least effort. Overall, there is an exciting summer ahead and if you&amp;rsquo;re interested in keeping up please
feel free to check out the Polyglot app on Github!&lt;/p></description></item><item><title>Reproducible Analysis &amp; Models for Predicting Genomics Workflow Execution Time</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230616-charishulu/</link><pubDate>Fri, 16 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uga/genomicswfmodels/20230616-charishulu/</guid><description>&lt;p>Hi! I&amp;rsquo;m Charis, an undergraduate student in the IT and Big Data Analytics program at the Calvin Institute of Technology. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/">Reproducible Analysis &amp;amp; Models for Predicting Genomics Workflow Execution Time&lt;/a> my &lt;a href="https://drive.google.com/file/d/1dFkC2A0HUVaWd6NpCbTjRZVfYxQ7jRxJ/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/in-kee-kim/">In Kee Kim&lt;/a> and &lt;strong>Martin Putra&lt;/strong> aims to gain insight into features that are highly correlated with execution times of genomics workflows and build machine learning models for predicting workflow execution time.&lt;/p>
&lt;p>Genomics workflows exhibit a long-tail pattern in their execution times. According to the previous project team&amp;rsquo;s findings, approximately 2% of genomics workflows had a median execution time of up to 15%, resulting in weeks of execution. Interestingly, it was observed that input quality plays a role in these execution time differences. Therefore, we will analyze features such as the quality of input data as well as the amount of resources allocated in the execution of genomics workflows to find features that correlate with execution time. Based on these features we will build a machine learning model that can predict the execution time of genomics workflows.&lt;/p>
&lt;p>By collaborating with Shayantan Banerjee (another contributor) who will study data quality, I will study the system metrics of genomics workflows both at workflow-level and tool-level. Metrics will be collected by running genomics workflows using the Slurm workload manager under various resource allocation conditions. Genomics workflows will be executed on Chameleon clusters of different sizes.&lt;/p></description></item><item><title>GPU Emulator for Easy Reproducibility of DNN Training</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230613-haoranwu/</link><pubDate>Tue, 13 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/utexas/gpuemulator/20230613-haoranwu/</guid><description>&lt;p>Hi! I’m Haoran Wu, a third year at the University of Chicago majoring in Economics and Computer Science. With my &lt;a href="https://docs.google.com/document/d/1CcNbvbNAmY0XkV9ckjHnILdMh92h1wqLUYqpT6qIsZY/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>, I’m working on the &lt;a href="https://ospo.ucsc.edu/project/osre23/utexas/gpuemulator" target="_blank" rel="noopener">GPU Emulator for Easy Reproducibility of DNN Training&lt;/a> project with Professor &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/vijay-chidambaram/">Vijay Chidambaram&lt;/a>. A Deep Neural Network (DNN) is an advanced artificial neural network that employs multiple layers to process intricate patterns and relationships within data. It finds applications in various fields such as image and speech recognition, natural language processing, and predictive modeling. The layers in a DNN progressively extract higher-level features from raw input data, enabling the network to learn and generalize patterns effectively.&lt;/p>
&lt;p>The growing popularity of Deep Neural Networks has resulted in a substantial increase in demand for Graphics Processing Units (GPUs). GPUs are crucial for conducting matrix computations in DNN training and inference. However, they are expensive to purchase for personal use, and the limited availability of GPU resources in public research clouds like Chameleon further exacerbates the issue. This scarcity of resources can cause delays in DNN-related research projects.&lt;/p>
&lt;p>Nevertheless, not all DNN research experiments require the use of a GPU. System researchers, for instance, may be primarily interested in performance profiles and not necessarily in the accuracy of training or inference. These researchers might focus on optimizing the storage layer and data loading of DNN training. In such cases, a GPU emulator that accurately replicates GPU behavior without needing a physical GPU can fulfill their requirements. By utilizing a GPU emulator, system researchers can evaluate their system optimizations&amp;rsquo; performance without competing for limited GPU resources in the cloud, thereby avoiding unnecessary delays in their research progress. Our work will eventually be open source and benefit the community.&lt;/p></description></item><item><title>Optimizing FasTensor: Enabling Efficient Tensor Execution on GPUs</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/lbl/fastensor/20230605-ris0801/</link><pubDate>Mon, 05 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/lbl/fastensor/20230605-ris0801/</guid><description>&lt;p>Greetings,&lt;/p>
&lt;p>I am Rishabh Singh, and I am excited to be part of the 2023 Google Summer of code program. My &lt;a href="https://docs.google.com/document/d/14DRkbF1S0VnPcopd37Io0pgKVQ1bDSN3QMf3Os6JyBA/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-wu/">John Wu&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/bin-dong/">Bin Dong&lt;/a> focuses on optimizing the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lbl/fastensor">FasTensor&lt;/a> tensor computing library for efficient usage on GPUs, specifically targeting tensor contraction while preserving structure-locality. This optimization is crucial for scientific applications and advanced AI model training. Throughout the project, I will develop custom computational operations for GPUs, implement FasTensor on GPUs, assess its performance, and provide comprehensive documentation. By the end, I aim to deliver a working implementation, a performance report, and a detailed execution mechanism guide. Leveraging my background in software engineering and machine learning, I will utilize languages like C++ and OpenMP to ensure efficient memory management and data movement. Stay tuned for regular updates and informative blogs as I progress through the summer.&lt;/p></description></item><item><title>Using Reproducibility in Machine Learning Education</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230605-kokoedwin/</link><pubDate>Mon, 05 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230605-kokoedwin/</guid><description>&lt;p>I am Jonathan Edwin, coming from Indonesia, and I am extremely thrilled to be involved in the 2023 Summer of Reproducibility initiative. I am actively contributing to the project by making valuable contributions to the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml">Using Reproducibility in Machine Learning Education&lt;/a> project.&lt;/p>
&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml">Using Reproducibility in Machine Learning Education&lt;/a> my &lt;a href="https://drive.google.com/file/d/1UEIKfZuPwJ88fMQ1-109vzpA7r4-7ehG/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> aims to develop educational resources focusing on reproducing and replicating fundamental machine-learning techniques, such as Cutout data augmentation, U-Net, and Siamese networks. The project aims to provide students with a hands-on learning experience that enhances their understanding of the models and their underlying principles while imparting valuable skills in ensuring research reproducibility.
The project will involve the creation of a series of interactive Jupyter notebooks covering the selected papers, guiding students through reproducing results, and focusing on best practices for ensuring reproducibility. Upon completion, the notebooks will provide a comprehensive and accessible learning experience for students while emphasizing the importance of reproducibility in machine learning education.
The proposal also identifies potential challenges associated with the project and proposed solutions to address them. Challenges include incompatibility issues with the original code and current frameworks or environments, difficulty in reproducing the exact results due to factors such as randomness or lack of specific details in the paper, and ensuring that the interactive elements in the Jupyter Notebooks are engaging and effective in teaching reproducibility concepts.&lt;/p></description></item><item><title>FlashNet: Towards Reproducible Continual Learning for Storage System</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230604-rannnayy/</link><pubDate>Sun, 04 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230604-rannnayy/</guid><description>&lt;p>Hello! I&amp;rsquo;m Rani, a third year undergraduate student at Institut Teknologi Bandung majoring at Informatics. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet">FlashNet&lt;/a> my &lt;a href="https://drive.google.com/file/d/1EhJm3kqrpybOkpXiiRMfqVxGeKe9iIsh/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haryadi-s.-gunawi/">Haryadi S. Gunawi&lt;/a> and &lt;strong>Daniar Kurniawan&lt;/strong> aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques.&lt;/p>
&lt;p>In real world workloads, it is known that the I/O stream changes and varies. Hence, the performance of I/O read/write could vary and introduce the tail latency. We would like to predict the latency of I/O read to cut the tail and improve the system&amp;rsquo;s performance. This project focuses on improving the FlashNet pipeline and introducing adaptability to the machine learning models built.&lt;/p>
&lt;p>During the summer, we planned to implement the continual learning pipeline using machine learning models we have built previously in the project. Of course, continual learning isn&amp;rsquo;t a continual learning without the ability of self-motivated retraining. Thus, we will implement several drift detection algorithms, evaluate, and test them. Besides, we will also build a visualization platform to evaluate and monitor the performance of the models built. Lastly, we planned to create Chameleon Trovi artifacts to demonstrate our experiments and make these implementations available and reproducible to the public.&lt;/p></description></item><item><title>Introducing Levels of Reproduction and Replication in Machine Learning</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230601-msaeed/</link><pubDate>Thu, 01 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230601-msaeed/</guid><description>&lt;p>Greetings everyone,&lt;/p>
&lt;p>I am Mohamed Saeed and I am delighted to be part of the 2023 Summer of Reproducibility program, where I am contributing to the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml">Using Reproducibility in Machine Learning Education&lt;/a> project.&lt;/p>
&lt;p>My &lt;a href="https://drive.google.com/file/d/13HnCMZawpabiLdBoOiaJFF2mNXIPLCVJ/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> was accepted, and I am fortunate to have &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> as my mentor. The objective of my project is to develop highly interactive open educational resources that can be utilized by instructors teaching graduate or undergraduate machine learning courses. These resources will focus on integrating instruction on reproducibility and reproducible research principles.&lt;/p>
&lt;p>Understanding and practicing reproducibility in machine learning (ML) research is of utmost importance in today&amp;rsquo;s scientific and technological landscape. Reproducibility ensures the reliability, transparency, and credibility of ML findings and discoveries. By learning the principles of reproducibility, students from different levels can validate research results, test introduced methodologies, and understand level of reproducibilty of research.&lt;/p>
&lt;p>My contribution will involve developing interactive educational resources that encompass code examples, writing exercises, and comprehensive explanations of key concepts of reproducing ML research. These resources will be carefully crafted to assist students at various levels of expertise. Our aim is for these resources to be widely adopted by instructors teaching graduate or undergraduate machine learning courses, as they seek to enhance the understanding of reproducibility and reproducible research principles.&lt;/p>
&lt;p>I think this is a great opportunity to learn more about ML research reproducibility. I&amp;rsquo;ll be posting regular updates and informative blogs throughout the summer, so stay tuned!&lt;/p></description></item><item><title>ScaleBugs: Reproducible Scalability Bugs</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucdavis/scalebugs/20230601-boluwarinayinmode/</link><pubDate>Thu, 01 Jun 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucdavis/scalebugs/20230601-boluwarinayinmode/</guid><description>&lt;p>Hello! As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucdavis/scalebugs/">ScaleBugs&lt;/a> project our proposals (&lt;a href="https://drive.google.com/file/d/17iANa5ei_gguZsGGwR1sfPHOoJysnNsf/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> from &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/goodness-ayinmode/">Goodness Ayinmode&lt;/a> and &lt;a href="https://drive.google.com/file/d/199ZsiWHXsLYbSJ896vaf8tjrYs23P5xN/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> from &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/zahra-nabila-maharani/">Zahra Nabila Maharani&lt;/a>) under the mentorship under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/cindy-rubio-gonzalez/">Cindy Rubio González&lt;/a>,&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haryadi-s.-gunawi/">Haryadi S. Gunawi&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/hao-nan-zhu/">Hao-Nan Zhu&lt;/a> aims to build a dataset of reproducible scalability bugs by analyzing bug reports from popular distributed systems like Cassandra, HDFS, Ignite, and Kafka. For each bug report, we will analyze whether the reported bug is influenced by the scale of the operation, such as the number of nodes being used or a number of requests. The resulting dataset will consist of bug artifacts containing the buggy and fixed versions of the scalability system, a reproducible runtime environment, and workload shell scripts designed to demonstrate bug symptoms under different scales. These resources will help support research and development efforts in addressing scalability issues and optimizing system performance.&lt;/p></description></item><item><title>Reproducible Evaluation of Multi-level Erasure Coding</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ornl/multilevelerasure/20230531-zhiyanw/</link><pubDate>Wed, 31 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ornl/multilevelerasure/20230531-zhiyanw/</guid><description>&lt;p>Hi! My name is Alex, an undergraduate student at the University of Chicago. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ornl/MultiLevelErasure">Reproducible Evaluation of Multi-level Erasure Coding&lt;/a>, my &lt;a href="https://docs.google.com/document/d/1dO1aING1QcSB---XklzUjNz0usVh7qWffVGC3GZq2AE/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-bent/">John Bent&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjus-george/">Anjus George&lt;/a> aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations.&lt;/p>
&lt;p>To provide some context, Erasure Coding (EC) is a common approach to protect data from disk failures. Data centers nowadays increasingly use Multi-Level Erasure Coding (MLEC), a newly developed erasure coding method that aims to deal with the drawbacks of Single-Level Erasure Coding (SLEC). Despite its increasing popularity, there have not been many systematic studies to analyze and evaluate MLEC, which is the focus of this project.&lt;/p>
&lt;p>The evaluation will primarily be conducted through simulations, since modifying configurations in a real large-scale system is costly and impractical. The expected deliverables of this project will be:&lt;/p>
&lt;ul>
&lt;li>An MLEC simulator that can reproducibly simulate different configurations of the MLEC system, e.g. coding parameter selection, chunk placement scheme, repair method choice, etc.&lt;/li>
&lt;li>An analysis of the performance and durability tradeoffs between different MLEC design choices based on the evaluation results from the simulation&lt;/li>
&lt;li>Reproduced SLEC evaluation results using existing SLEC simulators&lt;/li>
&lt;li>A comparison between MLEC and SLEC on performance and durability tradeoffs&lt;/li>
&lt;li>Well-written documents and detailed guides on how to reproduce the evaluation results&lt;/li>
&lt;/ul>
&lt;p>Our plan is to build the simulator throughout the summer. We hope our simulator and evaluation results can provide designers of large-scale storage systems with valuable insights on choosing the most appropriate erasure coding configuration per their needs.&lt;/p></description></item><item><title>[FLASHNET]: Leveraging ML-augmented I/O in Linux</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230530-justin08784/</link><pubDate>Tue, 30 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/flashnet/20230530-justin08784/</guid><description>&lt;p>Hi! I&amp;rsquo;m Justin, an undergraduate at the University of Chicago. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet">Flashnet&lt;/a> project my &lt;a href="https://drive.google.com/file/d/1gsNaYUYOgdN2ilpyPOmI7jjLeoZh219J/view" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of
&lt;strong>Daniar Kurniawan&lt;/strong> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haryadi-s.-gunawi/">Haryadi S. Gunawi&lt;/a> aims to port the Flashnet model into the Linux kernel.&lt;/p>
&lt;p>In this attempt, I will borrow architecture/design choices from LAKE (to take advantage of its integration of ML-focused hardware acceleration in the kernel) and evaluation criteria from LinnOS to test for model inference accuracy. I also plan to support latency &amp;ldquo;bucket&amp;rdquo; inference output to improve accuracy. Ultimately, my goal is to gain further insight into best practices for integrating ML models into real-life operating systems like Linux and to inform general design choices for the Flashnet pipeline.&lt;/p></description></item><item><title>Intro: Open Source Autonomous Vehicle Controller</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230530-25chilingh/</link><pubDate>Tue, 30 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230530-25chilingh/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/osavc">Open Source Autonomous Vehicle Controller Project&lt;/a> my &lt;a href="https://docs.google.com/document/d/1hDU87aAzbn88vWwOHH0ggIID2W4KKzp8SKF1Lb8LU90/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;strong>Aaron Hunter and Carlos Espinosa&lt;/strong> aims to create comprehensive technical documentation to help onboard new users of the OSAVC controller. I will be writing tutorials and examples to demonstrate how to start with an OSAVC, programming it with the robotic equivalent of HelloWorld and later moving onto more sophisticated explanations. Hence, this will encourage more applications and wider adoption in the field of autonomous vehicles and expand the community of OSAVC users.&lt;/p></description></item><item><title>Reproduce and benchmark self-adaptive edge applications under dynamic resource management</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/edgebench/20230530-zharfanf/</link><pubDate>Tue, 30 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/uchicago/edgebench/20230530-zharfanf/</guid><description>&lt;p>Hello there!&lt;/p>
&lt;p>I am Faishal Zharfan, a senior year student studying Telecommunication Engineering at Bandung Institute of Technology (ITB) in Bandung, Indonesia, my &lt;a href="https://drive.google.com/file/d/1u3UsCQZ40erpPmyoyn8DEVqH5Txmvvkz/view?usp=drive_link" target="_blank" rel="noopener">proposal&lt;/a>. I&amp;rsquo;m currently part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/edgebench/">Edgebench&lt;/a> under the mentorship of Yuyang Huang. The main goal of this project is to be able to reproduce and benchmark self-adaptive video applications using the proposed solution.&lt;/p>
&lt;p>The topic that I&amp;rsquo;m currently working on is &amp;ldquo;Reproduce and benchmark self-adaptive edge applications under dynamic resource management&amp;rdquo; or known as edgebench is led by Prof. Junchen Jiang and Yuyang Huang. Edgebench is a project that focuses on how to efficiently distribute resource (bandwidth and cpu usage) across several video applications. Nowaday&amp;rsquo;s video applications process its data or video on a server or known as edge computing, hence bandwidth or compute unit may be the greatest concern if we talk about edge computing in terms of WAN, because it is strictly limited. We may distribute the bandwidth evenly across the cameras, however the needs of bandwidth/compute unit of each camera is different. Therefore we need another solution to tackle this problem, the solution proposed recently is called &amp;ldquo;accuracy gradient&amp;rdquo;, with this solution, we can tell how much of one application needs the bandwidth on a certain time to achieve higher accuracy. The goal of this solution is to allocate more bandwidth to the apps which has the higher f1-score improvement and reduce the other which doesn&amp;rsquo;t have a significant diminishment of f1-score. Henceforth, in the end we would have a higher total f1-score.&lt;/p>
&lt;p>Throughout this summer, we have planned to implement the &amp;ldquo;accuracy gradient&amp;rdquo; and test several baselines to be compared with the solution. As for the implementation, we are currently implementing the latency measurement. We are aware that there is an overhead over this solution, therefore the latency should be taken into account.&lt;/p></description></item><item><title>Enhancing and Validating LiveHD's Power Modeling Flow</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/livehd/20230529-shahzaibk23/</link><pubDate>Mon, 29 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/livehd/20230529-shahzaibk23/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/livehd">Enhancing and Validating LiveHD&amp;rsquo;s Power Modeling Flow&lt;/a> my &lt;a href="https://docs.google.com/document/d/1_GtzWf_gCKkreN1-6VSAI4h2BqwKEUDGkNNB1OM554I/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of Jose Renau and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/sakshi-garg/">Sakshi Garg&lt;/a> aims to enhance and validate LiveHD&amp;rsquo;s power modeling flow, a critical feature for estimating power consumption in modern hardware designs. The existing flow requires further refinement to ensure its stability, accuracy, compatibility with a wider range of netlists and VCD files, and overall performance. To address these challenges, the project will focus on methodically debugging the current implementation, establishing a comprehensive validation methodology for verifying the accuracy of power estimates, and optimizing the flow to handle larger netlists and VCD files efficiently. Additionally, the project aims to improve existing documentation by providing detailed explanations, examples, and tutorials to facilitate user adoption and understanding. Upon successful completion, the project will deliver a more reliable, accurate, and efficient power modeling flow within LiveHD, contributing to the development of energy-efficient hardware designs. This refined flow will not only enhance the capabilities of LiveHD but also encourage wider adoption and utilization by the hardware design community, fostering innovation in the field of energy-efficient devices and systems.&lt;/p></description></item><item><title>High Fidelity UAV Simulation Using Unreal Engine with specular reflections</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230601-damodardatta/</link><pubDate>Mon, 29 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230601-damodardatta/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/osavc">Open Source Autonomous Vehicle Controller&lt;/a> my &lt;a href="https://drive.google.com/file/d/18g-WRZj_7ufIt6YZNn4OG1s7VKi1u5hV/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;strong>Aaron Hunter and Carlos Espinosa&lt;/strong> aims to Develop a unreal engine based simulator for testing. The simulator will be using unreal engine for the physics and visualization.&lt;/p>
&lt;p>The existing framework uses gazebo simulator with ROS which limit the developement to only Python and C++ programing languages. I intend to develope this simulator with intention connecting it with Python and C++, additionaly expanding support to Matlab so that in future the control algorithm design and validation process becomes easier. To smoothen future developement, i intent to add detailed documentation consisting of the developement period weekly report, examples and tutorial. Upon succesful completion, the project will deliver a powerful simulator with realistic simulation using unreal engine and additional support other programming languages like matlab.&lt;/p>
&lt;p>For more information about the Open Source Autonomous Vehicle Controller and the UC OSPO organization, you can visit the &lt;a href="https://github.com/uccross/open-source-autonomous-vehicle-controller" target="_blank" rel="noopener">OSAVC project repository&lt;/a> and the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/">UC OSPO website.&lt;/a>&lt;/p></description></item><item><title>OpenRAM Layout verses Schematic (LVS) visualization</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/openram/20230529-mahnoor-ismail01/</link><pubDate>Mon, 29 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/openram/20230529-mahnoor-ismail01/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/openram">OpenRAM Layout verses Schematic (LVS) visualization&lt;/a> my &lt;a href="https://docs.google.com/document/d/1QEBOglVgy20s0v1_vfpFHw8CdIYUbex12TOjSlAe1-E/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jesse-cirimelli-low/">Jesse Cirimelli-Low&lt;/a> and &lt;a href="mailto:mrg@ucsc.edu">Matthew Guthaus&lt;/a> aims to develop a comprehensive Python-based graphical user interface (GUI) with a robust backend system to effectively analyze, visualize, and debug layout versus schematic (LVS) mismatches in the OpenRAM framework. The proposed solution focuses on efficiently processing LVS report files in JSON format, identifying mismatched nets in the layout, and visually representing extra nets in the schematic graph using advanced backend algorithms. By implementing a powerful backend system, the GUI will streamline the debugging process and improve overall productivity, while maintaining high performance and reliability. The deliverables for this project include a fully-functional GUI with a performant backend, features for visualizing and navigating through LVS mismatches, comprehensive documentation, and user guides.&lt;/p></description></item><item><title>Automatic Cluster Performance Shifts Detection Toolkit</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/anl/perfdrift/20230527-kangrui/</link><pubDate>Sat, 27 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/anl/perfdrift/20230527-kangrui/</guid><description>&lt;p>Hi! I am Kangrui, a Pre-doc student at the University of Chicago. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/anl/perfdrift">Automatic Cluster Performance Shifts Detection Toolkit&lt;/a> my &lt;a href="https://drive.google.com/file/d/1AxpgWLzF3oKTFlD8q6JYS35CxxJ6c76X/view?usp=share_link" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;strong>Sandeep Madireddy&lt;/strong> and &lt;strong>Ray Andrew&lt;/strong> aims to design a real-time performance shift detection algorithm for high-performance computing clusters, ensuring minimal overheads.&lt;/p>
&lt;p>This project focuses on developing a real-time performance shift detection algorithm tailored to heterogeneous workloads, aiming to promptly inform administrators about performance changes. The primary goal is to design an algorithm that efficiently detects shifts in real-time, with minimal system overheads.&lt;/p>
&lt;p>In addition to algorithm development, we plan to enhance the Darshan toolkit&amp;rsquo;s functionality by integrating our algorithm, offering users early performance shift detection. This integration will aid administrators in making informed system utilization and scheduling decisions.&lt;/p>
&lt;p>To promote transparency and reproducibility, we&amp;rsquo;ll encapsulate our findings, scripts, and profiling data within a Jupyter notebook, especially Chameleon Trovi, enabling other researchers to reproduce our experiments easily.&lt;/p>
&lt;p>Looking ahead, we plan to expand the algorithm&amp;rsquo;s applicability to cater to diverse HPC workloads and infrastructures. Other areas of interest include its use in detecting shifts in financial markets or monitoring IoT data streams. Further refinement of our algorithm, to reduce overheads and improve real-time detection capabilities, is also a part of our future endeavours. This task may involve evaluating various shift detection methods and noise filtering techniques.&lt;/p></description></item><item><title>Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230527-indianspeedster/</link><pubDate>Sat, 27 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/eduml/20230527-indianspeedster/</guid><description>&lt;p>Hey,&lt;/p>
&lt;p>I am Shekhar and I am one of several students who are working on the project &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml">Using Reproducibility in Machine Learning Education&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a>. My &lt;a href="https://drive.google.com/file/d/1rCzLGIJ8HYCVjY_MfndgrQjAQa2SQbqZ/view?usp=sharing" target="_blank" rel="noopener">Proposal&lt;/a> aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. My project is inspired by my experience in the &lt;a href="https://paperswithcode.com/rc2022" target="_blank" rel="noopener">Machine Learning Reproducibility Challenge&lt;/a>, where I found that a major challenge for reproducibility was that some details were left ambiguous in the paper I was trying to reproduce. For my project, I will develop an interactive tutorial to help demonstrate how if the methodology details are not fully specified in a publication, then someone trying to reproduce the result will have to make choices that may not match the authors’, and these choices will affect whether or not the final result is validated.&lt;/p></description></item><item><title>Efficient Communication with Key/Value Storage Devices</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230526-manank/</link><pubDate>Fri, 26 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/kvstore/20230526-manank/</guid><description>&lt;p>Hi everyone!&lt;/p>
&lt;p>I&amp;rsquo;m Manank Patel, and am currently an undergraduate student at Birla Institute of Technology and Sciences - Pilani, KK Birla Goa Campus. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/kvstore">Efficient Communication with Key/Value Storage Devices&lt;/a> my &lt;a href="https://drive.google.com/file/d/1iJIlHuCpnvDeOyr5DphDDimqdl9s4hKH/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/aldrin-montana/">Aldrin Montana&lt;/a> and &lt;strong>Philip Kufeldt&lt;/strong> aims to implement io_uring based communication backend for network based key-value store.&lt;/p>
&lt;p>io_uring offers a new kernel interface that can improve performance and avoid the overhead of system calls and zero copy network transmission capabilities. The KV store clients utilize traditional network sockets and POSIX APIs for their communication with the KV store. A notable advancement that has emerged in the past two years is the introduction of a new kernel interface known as io_uring, which can be utilized instead of the POSIX API. This fresh interface employs shared memory queues to facilitate communication between the kernel and user, enabling data transfer without the need for system calls and promoting zero copy transfer of data. By circumventing the overhead associated with system calls, this approach has the potential to enhance performance significantly.&lt;/p></description></item><item><title>Update OpenROAD Documentation and Tutorials</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230526-luarss/</link><pubDate>Fri, 26 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/openroad/20230526-luarss/</guid><description>&lt;p>Hi! I am Jack, a Masters student at the National University of Singapore. In GSoC 2023, I will be undertaking the project entitled &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsd/openroad">Update OpenROAD Documentation and Tutorials&lt;/a> to improve the user experience and documentation of this exciting open-source RTL-to-GDSII framework, jointly mentored by &lt;strong>Indira Iyer Almeida&lt;/strong> and &lt;strong>Vitor Bandeira&lt;/strong>. Check out my proposal &lt;a href="https://drive.google.com/file/d/1_R4zDe2N05LtAsvDKa3w6C98GvIZ8HAI/view?usp=sharing" target="_blank" rel="noopener">here!&lt;/a>&lt;/p>
&lt;p>This project aims to review and update missing documentation and tutorials in OpenROAD-flow-scripts. A key focus will be on increasing ease-of-setup by updating documentation, setup scripts and docker-based commands. Next, we will also update documentation for the following OpenROAD components: Makefile flow variable, distributed detailed routing, Hier-RTLMP, Autotuner. If time permits, cloud enablement will be implemented, alongside notebook-based packaging to further increase ease of adoption.&lt;/p></description></item><item><title>Advancing Reproducible Science through Open Source Laboratory Protocols as Software</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/labop/20230621-luhesketh/</link><pubDate>Thu, 25 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsd/labop/20230621-luhesketh/</guid><description>&lt;p>Hello everyone!&lt;/p>
&lt;p>My name is Luiza, I am an eighth-semester Bsc Biological Sciences student from São Paulo, Brazil. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsd/labop">LabOp&lt;/a> working group, my &lt;a href="https://docs.google.com/document/d/1pJ7UIATZYASXjbLdUosvq08QkhPNTFxZFId9dapNp-o/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/dan-bryce/">Dan Bryce&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/tim-fallon/">Tim Fallon&lt;/a> aims to build a conversor that takes normal laboratory protocols and translates them into machine executable protocols. This is possible thanks to LabOP&amp;rsquo;s versatility to represent what a Laboratory protocol should look like. I´ll be testing this specialization in Hamilton machines that are great for experimenting scalling up.&lt;/p>
&lt;p>Nowadays we face a very common issue between Biotechnology laboratories, that is that protocols are difficult to share and to adapt for machine execution. Laboratory protocols are critical to biological research and development, yet complicated to communicate and reproduce across projects, investigators, and organizations. While many attempts have been made to address this challenge, there is currently no available protocol representation that is unambiguous enough for precise interpretation and automation, yet simultaneously abstract enough to enable reuse and adaptation.&lt;/p>
&lt;p>With LabOP we can take a protocol and convert it in multiple ways depending on the needs of the researcher for automation or human experimentation and allowing flexibility for execution and experimentation so I`ll be building a specialization that translates protocols in a way that they can be executed by Hamilton machines.&lt;/p></description></item><item><title>Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/osu/missingsettings/20230526-ren.450/</link><pubDate>Thu, 25 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/osu/missingsettings/20230526-ren.450/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/osu/missingsettings">Measuring Research Prototypes under Unreported Settings&lt;/a> my &lt;a href="https://drive.google.com/file/d/1ouFre-qMDCL_LiH5jFNUCOI1yAYHdWcS/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yang-wang/">Yang Wang&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/miao-yu/">Miao YU&lt;/a> aims to understand the impact of missing settings in artifact evaluation.&lt;/p>
&lt;p>The project plans to measure the impact of different missing settings for open-source database systems, such as MySQL and PostgreSQL particularly under the TPC-C Benchmark. The objective requires to run experiments on popular settings that are not reported and fix any problems during the experiments for the target systems. The project will compare the performance characteristics, and analyze the impact of missing settings on the performance of the target systems.&lt;/p></description></item><item><title>PolyPhy Infrastructure Enhancement</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230525-prashantjha/</link><pubDate>Thu, 25 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/polyphy/20230525-prashantjha/</guid><description>&lt;p>Hey!&lt;/p>
&lt;p>I&amp;rsquo;m Prashant Jha, from Pune, a recent undergraduate student from BITS Pilani. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/polyphy">Polyphy&lt;/a> my &lt;a href="https://drive.google.com/file/d/1y2X1_6_HliYowZn-qHd7x_Hz6QC3-KSe/view" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a> aims to develop and improve the current infrastructure.&lt;/p>
&lt;p>Polyphorm / PolyPhy - which is led by
Oskar Elek. PolyPhy is an organization that focuses on developing a GPU oriented
agent-based system for reconstructing and visualizing optimal transport networks
defined over sparse data. With its roots in astronomy and inspiration drawn from nature,
PolyPhy has been instrumental in discovering network-like patterns in natural language
data and reconstructing the Cosmic web structure using its early prototype called
Polyphorm. The organization aims to provide a richer 2D / 3D scalar field representation
of the reconstructed network, making it a toolkit for a range of specialists across
different disciplines, including astronomers, neuroscientists, data scientists, and artists.
PolyPhy&amp;rsquo;s ultimate purpose is to create quantitatively comparable structural analytics
and discover connections between different disciplines. To achieve its goals, PolyPhy
requires a robust infrastructure that is engineered using DevOps, Code Refactoring, and
Continuous Integration/Continuous Deployment (CI/CD) practices.
You can see an instructive overview of PolyPhy in our workshop and more details about our research &lt;a href="https://polyphy.io/" target="_blank" rel="noopener">here&lt;/a>.&lt;/p></description></item><item><title>Strengthening Underserved Segments of the Open Source Pipeline</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/sus/20230524-nandinisaagar/</link><pubDate>Thu, 25 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/sus/20230524-nandinisaagar/</guid><description>&lt;p>Namaste everyone🙏🏻!&lt;/p>
&lt;p>I&amp;rsquo;m Nandini Saagar, from Mumbai. An undergraduate student at the Indian Institute of Technology, Banaras Hindu University, IIT (BHU), Varanasi. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/sus">Strengthening Underserved Segments of the Open Source Pipeline&lt;/a> my &lt;a href="https://docs.google.com/document/d/1snzaUfBvptLcWP7I8IyKYFuBNfVGxNe9mnYkFXhb5ZM/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/emily-lovell/">Emily Lovell&lt;/a> aims to strengthen the underserved segment of the open source pipeline.&lt;/p>
&lt;p>My interest in Open Source was first piqued as a freshman when I was introduced to Open Source as a place where people from all communities and backgrounds come together to create software that can have real-world impact, that too in a completely autonomous and self-governed manner! I am so glad that I could transition from just a person who imagined Open Source to be a fair-eyed dream to being a part of multiple such communities. This journey has been life-defining for me, and that’s why I want to help deliver the message of Open Source to all teenagers!&lt;/p>
&lt;p>This project seeks to invite and support broader, more diverse participation in open source by supporting early contributors, especially those who have been historically minoritized within tech. It will aim to create content that anyone with some Open Source experience can use to help and guide new students to the journey of OpenSource, GitHub, and all the relevant technologies, provide a medium and platform for all contributors to share their various OpenSource experiences and testimonials, conduct an Open Source Themed Hackathon/Scavenger Hunt, and leverage the power of social media engagement to get young and brilliant minds acquainted with the technical and open-source world at an early age.&lt;/p>
&lt;p>Stay tuned to explore the enormous world of Open Source with me!&lt;/p></description></item><item><title>Open Source Autonomous Vehicle Controller</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230525-aniruddha1261/</link><pubDate>Wed, 24 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/osavc/20230525-aniruddha1261/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/osavc">Open Source Autonomous Vehicle Controller Project&lt;/a> my &lt;a href="https://drive.google.com/file/d/1_w9RfOM6XWruYUDR1d1yo45tQenpTQq5/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;strong>Aaron Hunter and Carlos Espinosa&lt;/strong> aims to Develop a tutorial that serves as a comprehensive guide for new users of the OSAVC controller. The tutorial will start from scratch, demonstrating how to initialize and program the controller using the equivalent of a &amp;ldquo;Hello, World!&amp;rdquo; program. Subsequently, it will progress to more advanced applications.&lt;/p>
&lt;p>Throughout the project, I will work closely with my mentors to ensure the accuracy, clarity, and usability of the documentation. Their guidance and expertise will be instrumental in achieving the project&amp;rsquo;s objectives effectively.&lt;/p>
&lt;p>By creating comprehensive technical documentation, this project aims to empower new users to harness the capabilities of the OSAVC controller. It will facilitate their understanding of the controller&amp;rsquo;s functionalities and enable them to leverage its potential in the field of autonomous vehicle applications.&lt;/p>
&lt;p>I am excited to embark on this journey, contribute to the open-source community, and make a valuable impact in the field of autonomous vehicles. Stay tuned for regular updates and progress reports as I work towards achieving the goals set forth in this project.&lt;/p>
&lt;p>For more information about the Open Source Autonomous Vehicle Controller and the UC OSPO organization, you can visit the &lt;a href="https://github.com/uccross/open-source-autonomous-vehicle-controller" target="_blank" rel="noopener">OSAVC project repository&lt;/a> and the &lt;a href="https://ospo.ucsc.edu/" target="_blank" rel="noopener">UC OSPO website.&lt;/a>&lt;/p>
&lt;p>Stay connected and join me in this exciting endeavor!&lt;/p></description></item><item><title>Verify the reproducibility of an experiment</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230524-jesselima/</link><pubDate>Wed, 24 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/noworkflow/20230524-jesselima/</guid><description>&lt;p>Hello everyone,
my name is Jesse and I&amp;rsquo;m proud to be a fellow in this 2023 Summer of Reproducibility program, contributing to &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/noworkflow">noWorkflow&lt;/a> project.&lt;/p>
&lt;p>My &lt;a href="https://docs.google.com/document/d/1YMtPjZXcgt5eplyxIgQE8IBpQIiRlB9eqVSQiIPhXNU/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> was accepted under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joao-felipe-pimentel/">João Felipe Pimentel&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/juliana-freire/">Juliana Freire&lt;/a> and aims to
work mapping and testing the capture of the provenance in typical Data Science and Machine Learning experiments.&lt;/p>
&lt;h4 id="what">What&amp;hellip;&lt;/h4>
&lt;p>Although much can be said about what reproducibility means, the ability to replicate results in day-to-day Data Science and Machine Learning experiments can pose a significant challenge for individuals, companies and researche centers. This challenge becomes even more pronounced with the emergence of analytics and IA, where scientific methodologies are extensively applied on an industrial scale. Then reproducibility assumes a key role in productivity and accountability expected from Data Scientists, Machine Learning Engineers, and other roles engaged in ML/AI projects.&lt;/p>
&lt;h4 id="how">How&amp;hellip;&lt;/h4>
&lt;p>In the day-to-day, the pitfalls of non-reproducibility appear at different points of the experiment lifecycle. These challenges arise when multiple experiments need to be managed for an individual or a team of scientists. In a typical experiment workflow, reproducibility appears in different steps of the process:&lt;/p>
&lt;ul>
&lt;li>The need to track the provenance of datasets.&lt;/li>
&lt;li>The need to manage changes in hypothesis tests.&lt;/li>
&lt;li>Addressing the management of system hardware and OS setups.&lt;/li>
&lt;li>Dealing with outputs from multiple experiments, including the results of various model trials.&lt;/li>
&lt;/ul>
&lt;p>In academic environments, these issues can result in mistakes and inaccuracies. In companies, they can lead to inefficiencies and technical debts that are difficult to address in the future.&lt;/p>
&lt;h4 id="finally">Finally&amp;hellip;&lt;/h4>
&lt;p>I believe this is a great opportunity to explore the emergence of these two hot topics that are IA and reproducilibity! I will share more updateds here throughout this summer and hope we can learn a lot together!&lt;/p></description></item><item><title>Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edunet/20230524-srishti-j18/</link><pubDate>Tue, 23 May 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/nyu/edunet/20230524-srishti-j18/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/edunet">Teaching Computer Networks with Reproducible Research project&lt;/a> my &lt;a href="https://drive.google.com/file/d/1EI0Zhh6YFwufEZ-53VWwhTOyJUuw7-Rf/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> aims to develop a classroom competition for adaptive video delivery policies, leveraging an existing open-source reproducible result. The competition will challenge students to extend the original work and design their adaptive policies for head-to-head competition against their classmates.The project will involve packaging the existing result for easy reproducibility and building on it by implementing other adaptive video policies from the literature, developing different network settings for evaluating student submissions, and creating an evaluation framework for scoring submissions based on various criteria (so that competition remains fair and unbiased). The deliverables include a functional submission and evaluation process, an evaluation framework, and documentation and materials for course instructors to use in the classroom.&lt;/p></description></item><item><title>OSRE Catalyst</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/catalyst/</link><pubDate>Thu, 23 Mar 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/catalyst/</guid><description>&lt;p>Contributing to an open source project is a great way to build a technical portfolio, learn industry tools/practices, and have real-world impact – all while embedded in a collaborative community. The UC Santa Cruz Open Source Program Office (OSPO) wants to support more students on this path, especially those who have been minoritized in tech. We are partnering with an HBCU for a pilot summer program offering, with hopes to expand our reach in 2024.&lt;/p>
&lt;p>Through a hybrid (in-person/remote) model, participating students will spend four weeks on the UCSC campus learning about open source, followed by four weeks remotely contributing to an open source project. Participants will be well-supported by our instructional team, as well as their small peer cohort, through community-building and mentorship spanning the full eight weeks.&lt;/p>
&lt;h3 id="pilot-program-mentor--developer">Pilot Program Mentor &amp;amp; Developer&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Education&lt;/code>, &lt;code>Broadening Participation&lt;/code>, &lt;code>Mentorship and Support&lt;/code>, &lt;code>Community&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> communication, organization, GitHub/Markdown, basic web programming (HTML, CSS, JavaScript), open source contribution, version control/git workflow, mentorship, teaching&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Novice to Intermediate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/emily-lovell/">Emily Lovell&lt;/a>, &lt;a href="mailto:davis@soe.ucsc.edu">James Davis&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Given that this is a program pilot, your involvement and feedback will directly help shape its future!&lt;/p>
&lt;p>Possible tasks:&lt;/p>
&lt;ul>
&lt;li>Help cultivate a welcoming and supportive learning community&lt;/li>
&lt;li>Support students in completing hands-on activities related to open source contribution (e.g. evaluating potential projects/communities, using git, setting up a development environment)&lt;/li>
&lt;li>Develop technology-specific tutorials to introduce students to languages/libraries/etc. employed by their project&lt;/li>
&lt;li>Offer mentorship around how to navigate documentation, large codebases, and contributor communities&lt;/li>
&lt;li>Share your own input and perspective on what it&amp;rsquo;s like to be a newcomer to open source!&lt;/li>
&lt;/ul></description></item><item><title>eBPF Monitoring Tools</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lanl/ebpftools/</link><pubDate>Tue, 21 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lanl/ebpftools/</guid><description>&lt;p>&lt;a href="https://ebpf.io" target="_blank" rel="noopener">eBPF&lt;/a> is a technology that allows sandboxed programs to run in a priviledged context such as a Linux kernel. eBPF is for operating systems what Javascript is for web browsers: new functionality can be safely loaded without restarting or continually upgrading the operating system or browser and executed efficiently. eBPF is used to introduce new functionality into a running Linux kernel, including next-generation networking, observability, and security functionality. The following is just one idea of many possible.&lt;/p>
&lt;h3 id="implement-darshan-functionality-as-ebpf-tool">Implement Darshan functionality as eBPF tool&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> performance, I/O, workload characterization&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:treddy@lanl.gov">Tyler Reddy&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;a href="https://www.mcs.anl.gov/research/projects/darshan/" target="_blank" rel="noopener">Darshan&lt;/a> is an HPC I/O characterization tool that collect statistics using a lightweight design that makes it suitable for full time deployment. Darshan is an interposer library that catches and counts IO requests (open, write, read, etc.) to a file/file system and it keeps the counters in buckets in data structure that can be queried. How many reads of small size, medium size, large size) for example are the types of things that are counted.&lt;/p>
&lt;p>Having this be an interposer library requires users to link their application with this library. Having this function in epbf would make this same function transparent to users. Darshan has all the functions and could provide the list of functions to implement and the programmer could build and test these functions in ebpf on a linux machine. This could be a broadly available open tool that would be generally useful and but one of perhaps hundreds of examples of where ebpf based tools that could be in the open community for all to leverage.&lt;/p></description></item><item><title>Reproducible Evaluation of Multipath Network Protocols</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/farmingdale/multipath/</link><pubDate>Thu, 16 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/farmingdale/multipath/</guid><description>&lt;p>Lead Mentor: &lt;a href="mailto:aydini@farmingdale.edu">Ilknur Aydin&lt;/a>&lt;/p>
&lt;p>As mobile devices with dual WiFi and cellular interfaces become widespread, network protocols have been developed that utilize the availability of multiple paths. However, the relative effectiveness of these protocols is highly dependent on the characteristics of the network (including the relationship between the two paths, which are often not independent). Researchers typically evaluate a multipath protocol for a small set of network scenarios, which vary from one publication to the next. It is therefore difficult to get a good picture of how different protocols perform in a range of settings.&lt;/p>
&lt;h3 id="framework-for-repeatable-direct-comparison-of-multipath-transport-protocols">Framework for repeatable, direct comparison of multipath transport protocols&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Computer networks, wireless systems&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Linux, networking, data analysis and visualization, writing&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Large&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s)&lt;/strong>: &lt;a href="mailto:aydini@farmingdale.edu">Ilknur Aydin&lt;/a> and &lt;a href="mailto:ffund@nyu.edu">Fraida Fund&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>In single-path congestion control, the &lt;a href="https://pantheon.stanford.edu/" target="_blank" rel="noopener">Pantheon&lt;/a> work created a reference set of executable benchmarks that researchers could use to evaluate novel congestion control designs against existing work in a wide range of the scenarios. This project seeks to achieve something similar for multipath protocols, using publicly available networking testbeds like &lt;a href="https://fabric-testbed.net/" target="_blank" rel="noopener">FABRIC&lt;/a>. For this project, the participant will:&lt;/p>
&lt;ul>
&lt;li>Prepare a set of network benchmarks for multipath protocols, using live network links, real link traces, and emulated scenarios&lt;/li>
&lt;li>Develop an experiment using the benchmarks to evaluate existing multipath protocol implementations&lt;/li>
&lt;li>Prepare materials that researchers can use to evaluate novel multipath protocols against the others in the benchmark&lt;/li>
&lt;/ul></description></item><item><title>Proactive Data Containers (PDC)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lbl/pdc/</link><pubDate>Sun, 12 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lbl/pdc/</guid><description>&lt;p>&lt;a href="https://sdm.lbl.gov/pdc/about.html" target="_blank" rel="noopener">Proactive Data Containers&lt;/a> (PDC) are containers within a locus of storage (memory, NVRAM, disk, etc.) that store science data in an object-oriented manner. Managing data as objects enables powerful optimization opportunities for data movement and transformations, and storage mechanisms that take advantage of the deep storage hierarchy and enable automated performance tuning.&lt;/p>
&lt;h3 id="command-line-and-python-interface-to-an-object-centric-data-management-system">Command line and python interface to an object-centric data management system&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Python&lt;/code>, &lt;code>object-centric data management&lt;/code>, &lt;code>PDC&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Linux, C, Python&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/houjun-tang/">Houjun Tang&lt;/a>, &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;a href="https://github.com/hpc-io/pdc" target="_blank" rel="noopener">Proactive Data Containers (PDC)&lt;/a> is an object-centric data management system for scientific data on high performance computing systems. It manages objects and their associated metadata within a locus of storage (memory, NVRAM, disk, etc.). Managing data as objects enables powerful optimization opportunities for data movement and transformations, and storage mechanisms that take advantage of the deep storage hierarchy and enable automated performance tuning. This project includes developing and updating efficient and user friendly command line and Python interfaces for PDC.&lt;/p></description></item><item><title>Is Reproducibility Enough? Understanding the Impact of Missing Settings in Artifact Evaluation</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/osu/missingsettings/</link><pubDate>Wed, 08 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/osu/missingsettings/</guid><description>&lt;p>While Artifact Evaluation tries to ensure that the evaluation results in a paper are reproducible, it leaves one question: How about experiment settings NOT reported by the paper? Such “missing settings” may create multiple problems: 1) sometimes the artifacts simply do not work under these missing settings, creating problems when a later work needs to compare to an earlier work under these settings; 2) sometimes the artifacts do not perform well under these missing settings, which may create a bias during the evaluation; 3) to improve the artifact to work under these missing settings, sometimes one needs to re-design the system, which may change the results of the original experiments.&lt;/p>
&lt;p>In this project, we plan to understand the impact of this problem: On the necessity side, how would these missing settings affect the conclusions of the original work? On the feasibility side, how much effort does it require to carry out extensive experiments? We plan to answer these questions by reproducing prior works, running them on popular settings that are not reported by these works, and fixing problems if any.&lt;/p>
&lt;h3 id="measuring-research-prototypes-under-unreported-settings">Measuring Research Prototypes under Unreported Settings&lt;/h3>
&lt;p>&lt;strong>Topics:&lt;/strong> reproducibility, databases, key-value stores, DNN training&lt;br>
&lt;strong>Skills:&lt;/strong> Java/Python, Linux, TPC/YCSB&lt;br>
&lt;strong>Difficulty:&lt;/strong> Medium&lt;br>
&lt;strong>Size:&lt;/strong> 350 hours&lt;br>
&lt;strong>Mentor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yang-wang/">Yang Wang&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/miao-yu/">Miao YU&lt;/a>&lt;br>
&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/xueyuan-ren/">Xueyuan Ren&lt;/a>&lt;/p>
&lt;p>The student will first pick one or a few systems she is interested in. Then she will first try to reproduce their reported results. If successful, she will further try to measure these systems under previously unreported settings. During the procedure, she will need to diagnose and fix any problems that may show up. Finally, she will analyze whether the original conclusions still hold under these new settings and whether fixing any problems will change the performance characteristics of the target systems.&lt;/p></description></item><item><title>OpenRAM</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/openram/</link><pubDate>Wed, 08 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/openram/</guid><description>&lt;p>&lt;a href="https://github.com/VLSIDA/OpenRAM" target="_blank" rel="noopener">OpenRAM&lt;/a> is an award winning open-source Python framework to create the layout, netlists, timing and power models, placement and routing models, and other views necessary to use SRAMs in ASIC design. OpenRAM supports integration in both commercial and open-source flows with both predictive and fabricable technologies. Most recently, it has created memories that are included on all of the &lt;a href="https://efabless.com/open_shuttle_program/" target="_blank" rel="noopener">eFabless/Google/Skywater MPW tape-outs&lt;/a>.&lt;/p>
&lt;h3 id="layout-verses-schematic-lvs-visualization">Layout verses Schematic (LVS) visualization&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>VLSI Design Basics&lt;/code>, &lt;code>Python&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, VLSI, JSON&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Easy/Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jesse-cirimelli-low/">Jesse Cirimelli-Low&lt;/a>, &lt;a href="mailto:mrg@ucsc.edu">Matthew Guthaus&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/mahnoor-ismail/">Mahnoor Ismail&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Create a visualization interface to debug layout verses schematic mismatches in &lt;a href="https://github.com/RTimothyEdwards/magic" target="_blank" rel="noopener">Magic&lt;/a> layout editor. Results will be parsed from a JSON output of &lt;a href="https://github.com/RTimothyEdwards/netgen" target="_blank" rel="noopener">Netgen&lt;/a>.&lt;/p></description></item><item><title>noWorkflow</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/noworkflow/</link><pubDate>Tue, 07 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/noworkflow/</guid><description>&lt;p>The &lt;a href="https://github.com/gems-uff/noworkflow" target="_blank" rel="noopener">noWorkflow&lt;/a> project aims at allowing scientists to benefit from provenance data analysis even when they don&amp;rsquo;t use a workflow system. Also, the goal is to allow them to avoid using naming conventions to store files originated in previous executions. Currently, when this is not done, the result and intermediate files are overwritten by every new execution of the pipeline.&lt;/p>
&lt;p>noWorkflow was developed in Python, and it is currently able to capture provenance of Python scripts using Software Engineering techniques such as abstract syntax tree (AST) analysis, reflection, and profiling, to collect provenance without the need of a version control system or any other environment.&lt;/p>
&lt;p>At the moment of this writing, the main version of noWorkflow is in the 2.0-alpha branch. We intend to release it before the summer.&lt;/p>
&lt;h3 id="verify-the-reproducibility-of-an-experiment">Verify the reproducibility of an experiment&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Reproducibility&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, SQL or SQLAlchemy ORM&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joao-felipe-pimentel/">João Felipe Pimentel&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/juliana-freire/">Juliana Freire&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Implement an algorithm to compare the provenance from two (or more) trials (i.e., executions of an experiment) to check their reproducibility. The provenance stored in the relational (sqlite) database by noWorkflow 2 contains intermediate variable values from a trial. These values could be compared to check how much or where executions deviate from each other.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Compare trials of the same script (Medium)&lt;/li>
&lt;li>Estimate how much on trial deviate from another (Medium)&lt;/li>
&lt;li>Consider different scripts and execution flows (Large)&lt;/li>
&lt;li>Indicate which parts of the scripts are not reproducible (Large)&lt;/li>
&lt;/ul>
&lt;h3 id="control-levels-of-provenance-collection">Control levels of provenance collection&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Log experiments&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joao-felipe-pimentel/">João Felipe Pimentel&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/juliana-freire/">Juliana Freire&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jesse-lima/">Jesse Lima&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Add support for different levels of provenance collection in noWorkflow 2. Currently, noWorkflow 2 collects Python construct evaluations and all the dependencies among the evaluations. However, this collection is inefficient, since some of the collected provenance may not be necessary for end-users. In this project, it is desirable to provide ways to temporarily disable the provenance collection and to manually indicate the provenance in this situation.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Disable the collection inside specific functions (through decorators?)&lt;/li>
&lt;li>Disable the collection inside specific regions of the code (through with statements?)&lt;/li>
&lt;li>Collect only function activations in a region, instead of all variable dependencies&lt;/li>
&lt;li>Disable the collection of specific modules&lt;/li>
&lt;li>Design a DSL to express general dependencies for parts of the code where the collection is disabled&lt;/li>
&lt;/ul>
&lt;h3 id="upgrade-noworkflow-collection-to-support-new-python-constructs">Upgrade noWorkflow collection to support new Python constructs&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Log experiments&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joao-felipe-pimentel/">João Felipe Pimentel&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/juliana-freire/">Juliana Freire&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Implement new AST transformations for provenance collection. While noWorkflow 2 works for newer Python versions, most of its implementation was targeted at Python 3.7. Newer Python versions have new constructs in which the provenance is ignored.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Identify which AST constructs implementations are missing&lt;/li>
&lt;li>Design AST transformations to execute functions before and after the evaluation of the constructs&lt;/li>
&lt;li>Create the dependencies for the new constructs&lt;/li>
&lt;/ul></description></item><item><title>ScaleBugs: Reproducible Scalability Bugs</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucdavis/scalebugs/</link><pubDate>Tue, 07 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucdavis/scalebugs/</guid><description>&lt;p>Scalable systems lay essential foundations of the modern information industry. HPC data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs. As its name suggests, scalability bugs may be presented depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. For example, &lt;a href="https://issues.apache.org/jira/browse/CASSANDRA-6127" target="_blank" rel="noopener">Cassandra-6127&lt;/a> is a scalability bug detected in the popular distributed database Cassandra. The scalability bug causes unnecessary CPU usage, however, the symptom is not observed unless ~1000 nodes are deployed. This demonstrates the main challenge of studying scalability bugs: it is extremely challenging to reproduce without deploying the system at a large scale.&lt;/p>
&lt;p>In this project, our goal is to build a dataset of &lt;strong>reproducible&lt;/strong> scalability bugs. To achieve this, we will go through the existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. For each bug report, we determine if the reported bug depends on the scale of the run, such as the number of nodes utilized. With the collected scale-dependent bugs, we then will craft the workload to reproduce those scalability bugs. Our workloads will be designed to trigger some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance. For example, a successful reproduction should be able to show the performance drop along with an increasing number of nodes.&lt;/p>
&lt;h3 id="building-a-dataset-of-reproducible-scalability-bugs">Building a Dataset of Reproducible Scalability Bugs&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Scalability systems, bug patterns, reproducibility, bug dataset&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Linux Shell, Docker, Java, Python&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/cindy-rubio-gonzalez/">Cindy Rubio González&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haryadi-s.-gunawi/">Haryadi S. Gunawi&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/hao-nan-zhu/">Hao-Nan Zhu&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/goodness-ayinmode/">Goodness Ayinmode&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/zahra-nabila-maharani/">Zahra Nabila Maharani&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The student will build a dataset of reproducible scalability bugs. Each bug artifact in the dataset will contain (1) the buggy and fixed versions of the scalability system, (2) a runtime environment that ensures reproducibility, and (3) a workload shell script that could demonstrate the symptoms of the bug under different scales.&lt;/p>
&lt;h4 id="specific-tasks">Specific Tasks&lt;/h4>
&lt;ul>
&lt;li>Work with the mentors to understand the context of the project.&lt;/li>
&lt;li>Learn the background of scalability systems.&lt;/li>
&lt;li>Inspect the bug reports from Apache JIRA and identify scale-dependent bugs.&lt;/li>
&lt;li>Craft shell scripts to trigger the exact scalability bug described by the bug report.&lt;/li>
&lt;li>Organize the reproducible scalability bugs and write documentation to build the code
and trigger the bug.&lt;/li>
&lt;/ul></description></item><item><title>Strengthening Underserved Segments of the Open Source Pipeline</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/sus/</link><pubDate>Tue, 07 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/sus/</guid><description>&lt;p>Contributing to an open source project offers novices the opportunity to join a community of practitioners, build a technical portfolio, gain experience with industry tools and technologies, and have real-world impact. This project seeks to invite and support broader, more diverse participation in open source by supporting &lt;em>early contributors&lt;/em> – especially those who have been historically minoritized within tech.&lt;/p>
&lt;p>This work builds upon a number of existing projects with similar or overlapping goals. Some examples:&lt;/p>
&lt;ul>
&lt;li>The &lt;a href="http://teachingopensource.org" target="_blank" rel="noopener">Teaching Open Source (TOS) community&lt;/a>, which brings together instructors teaching open source&lt;/li>
&lt;li>The &lt;a href="http://foss2serve.org/index.php/POSSE" target="_blank" rel="noopener">Professors&amp;rsquo; Open Source Software Experience (POSSE) workshops and wiki&lt;/a>, for faculty teaching - or wanting to teach - open source&lt;/li>
&lt;li>Internships such as &lt;a href="https://summerofcode.withgoogle.com" target="_blank" rel="noopener">Google Summer of Code (GSoC)&lt;/a>, &lt;a href="https://www.outreachy.org" target="_blank" rel="noopener">Outreachy&lt;/a>, and the &lt;a href="https://fellowship.mlh.io" target="_blank" rel="noopener">MLH Fellowship&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://campus.openhatch.org" target="_blank" rel="noopener">Open Source Comes to Campus&lt;/a>, offering student workshops on tools and culture &lt;em>[no longer active]&lt;/em>&lt;/li>
&lt;li>&lt;a href="https://codein.withgoogle.com/archive/" target="_blank" rel="noopener">Google Code-in&lt;/a>, inviting pre-university students to make open source contributions &lt;em>[no longer active]&lt;/em>&lt;/li>
&lt;/ul>
&lt;p>This project will investigate gaps in currently available resources/programs and seek to address them, beginning with the exploration of engaging high school students with open source. Depending on early findings, this project could also entail the development of resources for independent learners and/or mentors.&lt;/p>
&lt;h3 id="learning-resource-development--repository-building">Learning Resource Development + Repository-Building&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Education&lt;/code>, &lt;code>Broadening Participation&lt;/code>, &lt;code>Mentorship and Support&lt;/code>, &lt;code>Community Development&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> independent research, communication, organization, GitHub/Markdown, basic web programming (HTML, CSS, JavaScript)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Novice to Intermediate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/emily-lovell/">Emily Lovell&lt;/a>, &lt;a href="mailto:davis@soe.ucsc.edu">James Davis&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/nandini-saagar/">Nandini Saagar&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>As an early contributor to this project, you will help gather information to inform the project direction – and then help bring it to life!&lt;/p>
&lt;p>Possible tasks:&lt;/p>
&lt;ul>
&lt;li>Meet with teachers and/or community members to identify new opportunities to engage with students (e.g. outside-of-school workshops, classroom visits, materials for teachers to use independently)&lt;/li>
&lt;li>Evaluate and test existing learning activities with a high school audience in mind (e.g. consider necessary pre-requisites, time required, ideal activity format)&lt;/li>
&lt;li>Evaluate and organize existing resources for newcomers (e.g. &lt;a href="https://up-for-grabs.net/#/" target="_blank" rel="noopener">Up For Grabs&lt;/a>, &lt;a href="https://hacktoberfest.com" target="_blank" rel="noopener">Hacktoberfest&lt;/a>, internship/fellowship opportunites)&lt;/li>
&lt;li>Help design and pilot new learning activities and/or workshops&lt;/li>
&lt;li>Assist in curating an open source repository of the aforementioned resources&lt;/li>
&lt;li>Conduct outreach to our target communities (e.g. brainstorm a catchy repository name, compose inviting and inclusive emails, design visual project elements)&lt;/li>
&lt;li>Share your own input and perspective on what it&amp;rsquo;s like to be a newcomer to open source!&lt;/li>
&lt;/ul></description></item><item><title>LabOP - an open specification for laboratory protocols, that solves common interchange problems stemming from variations in scale, labware, instruments, and automation.</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsd/labop/</link><pubDate>Mon, 06 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsd/labop/</guid><description>&lt;!---
Instructions for project submission here: https://ospo.ucsc.edu/osredocs/formentors/
All the projects so far:
https://ospo.ucsc.edu/osre/#projects
-->
&lt;h3 id="project-idea-1-software-hardware-and-wetware-building-labop-with-simultaneous-language--protocol-development--test-executions">Project idea 1: Software, hardware, and wetware building LabOP with simultaneous language &amp;amp; protocol development &amp;amp; test executions&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Software standard development, Laboratory automation, Biology&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, Semantic Web Technologies (RDF, OWL), interest to think about describing biological &amp;amp; chemical laboratory processes&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong>
&lt;ol>
&lt;li>&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/tim-fallon/">Tim Fallon&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/dan-bryce/">Dan Bryce&lt;/a>&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h4 id="about-the-laboratory-open-protocol-language-labop">About: The Laboratory Open Protocol Language (LabOP)&lt;/h4>
&lt;p>&lt;strong>See link: &lt;a href="https://bioprotocols.github.io/labop/" target="_blank" rel="noopener">https://bioprotocols.github.io/labop/&lt;/a>&lt;/strong>&lt;/p>
&lt;p>LabOP is an &lt;em>open&lt;/em> specification for laboratory protocols, that solves common interchange problems stemming from variations in scale,
labware, instruments, and automation. LabOP was built from the ground-up to support protocol interchange. It provides an extensible
library of protocol primitives that capture the control and data flow needed for simple calibration and culturing protocols to
industrial control.&lt;/p>
&lt;h5 id="software-ecosystem">Software Ecosystem&lt;/h5>
&lt;p>LabOP&amp;rsquo;s rich representation underpins an ecosystem of several powerful software tools, including:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://www.github.com/bioprotocols/labop" target="_blank" rel="noopener">labop&lt;/a>: the Python LabOP library, which supports:
&lt;ul>
&lt;li>&lt;em>Programming&lt;/em> LabOP protocols in Python,&lt;/li>
&lt;li>&lt;em>Serialization&lt;/em> of LabOP protocols conforming to the LabOP RDF specification,&lt;/li>
&lt;li>&lt;em>Execution&lt;/em> in the native LabOP semantics (rooted in the UML activity model),&lt;/li>
&lt;li>&lt;em>Specialization&lt;/em> of protocols to 3rd-party protocol formats (including Autoprotocol, OpenTrons, and human readible formats), and&lt;/li>
&lt;li>&lt;em>Integration&lt;/em> with instruments (including OpenTrons OT2, Echo, and SiLA-based automation).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="https://www.github.com/bioprotocols/laboped" target="_blank" rel="noopener">laboped&lt;/a>: the web-based LabOP Editor, which supports:
&lt;ul>
&lt;li>&lt;em>Programming&lt;/em> LabOP protocols quickly with low-code visual scripts,&lt;/li>
&lt;li>&lt;em>Storing&lt;/em> protocols on the cloud,&lt;/li>
&lt;li>&lt;em>Exporting&lt;/em> protocol specializations for use in other execution frameworks,&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="about-the-bioprotocols-working-group">About the Bioprotocols Working Group&lt;/h4>
&lt;p>The Bioprotocols Working Group is an open community organization developing a free and open standard for representation of biological
protocols.&lt;/p>
&lt;p>To join the Bioprotocols Working Group:&lt;/p>
&lt;ul>
&lt;li>Join the community mailing list at: &lt;a href="https://groups.google.com/g/bioprotocols" target="_blank" rel="noopener">https://groups.google.com/g/bioprotocols&lt;/a>&lt;/li>
&lt;li>Join the &lt;code>#collab-bioprotocols&lt;/code> channel on the &lt;a href="https://bitsinbio.org/" target="_blank" rel="noopener">Bits in Bio&lt;/a> Slack.&lt;/li>
&lt;/ul>
&lt;h5 id="leadership">Leadership&lt;/h5>
&lt;p>&lt;em>Elected Term: August 24th, 2022 - August 23rd, 2023&lt;/em>&lt;/p>
&lt;p>&lt;strong>Chair:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/dan-bryce/">Dan Bryce&lt;/a> (SIFT)&lt;/p>
&lt;p>&lt;strong>Finance Committee:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;a href="mailto:jeremy.cahill@metamerlabs.io">Jeremy Cahill (Metamer Labs)&lt;/a>&lt;/li>
&lt;li>&lt;a href="mailto:mark.doerr@uni-greifswald.de">Mark Doerr (University of Greifswald)&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/tim-fallon/">Tim Fallon&lt;/a> (UCSD)&lt;/li>
&lt;/ul>
&lt;h5 id="governance">Governance&lt;/h5>
&lt;p>&lt;em>Approved by community vote on August 16th, 2022&lt;/em>&lt;/p>
&lt;p>&lt;strong>&lt;a href="https://bioprotocols.github.io/labop/about#Governance" target="_blank" rel="noopener">https://bioprotocols.github.io/labop/about#Governance&lt;/a>&lt;/strong>&lt;/p>
&lt;h5 id="mission">Mission:&lt;/h5>
&lt;p>The Bioprotocols Working Group is an open community organization developing free and open standards for representation of biological
protocols. In support of that goal, the organization also develops tools and practices and works with other organizations to
facilitate dissemination and adoption of these standards.&lt;/p>
&lt;p>As an organization, the Bioprotocols Working Group holds the following values:&lt;/p>
&lt;ul>
&lt;li>The standards developed by the community should be available under permissive free and open licenses.&lt;/li>
&lt;li>Technical decisions of the community should be made following open and inclusive processes.&lt;/li>
&lt;li>The community is strengthened by fostering a culture of diversity and inclusion, in which all constructive participants feel
comfortable making their voices heard.&lt;/li>
&lt;/ul></description></item><item><title>GPU Emulator for Easy Reproducibility of DNN Training</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/utexas/gpuemulator/</link><pubDate>Sun, 05 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/utexas/gpuemulator/</guid><description>&lt;p>Deep Neural Networks (DNN) have achieved success in many machine learning (ML) tasks including image recognition, video classification and natural language processing. Nonetheless, training DNN models is highly computation intensive and usually requires running complex computations on GPUs, while GPU is a very expensive and scarce resource. Therefore, many research works on DNN training are delayed because of the lack of access to GPUs. However, many research prototypes don&amp;rsquo;t require GPUs but only the performance profiles of GPUs. For example, research on DNN training storage systems doesn’t need to run real computations on GPUs, but only needs to know how much time each GPU computation will take. Meanwhile, GPU performance in DNN training is predictable and reproducible, as every batch of training performs a deterministic sequence of mathematical operations on a fixed number of data.&lt;/p>
&lt;p>Therefore, in this project we seek to build a GPU emulator platform on PyTorch to easily reproduce DNN training without using real GPUs. We will measure the performance profiles of GPU computations for different models, GPU types, and batch sizes. Based on the measured GPU performance profiles, we will build a platform to emulate the GPU behaviors and reproduce DNN training using CPUs only. We will make the platform and the measurements open-source, allowing other researchers to reproduce the performance measurements and easily conduct research on DNN training systems. We will also encourage the community to enrich the database by adding GPU performance measurements for their own models and GPU types. We will be the first one to build and release this kind of GPU emulator for DNN training, and we believe researchers and the community can benefit a lot from it, especially after more and more GPU performance profiles are added by the community.&lt;/p>
&lt;h3 id="building-a-platform-to-emulate-gpu-performance-in-dnn-training">Building a platform to emulate GPU performance in DNN training&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> DNN training, reproducibility, GPU emulator, performance measurement - Skills: Linux, Python, PyTorch, deep learning&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/vijay-chidambaram/">Vijay Chidambaram&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/yeonju-ro/">Yeonju Ro&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haoran-wu/">Haoran Wu&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The student will measure the GPU performance profiles for different models and GPU types, based on which the student will build a platform to emulate the GPU behaviors and easily reproduce DNN training. The GPU performance measurements should be made open-source and reproducible for other researchers to reproduce results and add GPU profiles for their own needs.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Study and get familiar with the PyTorch DNN training pipelines&lt;/li>
&lt;li>Measure GPU performance profiles for different DNN models and GPU types&lt;/li>
&lt;li>Based on the GPU performance measurements, build a platform to emulate the GPU behaviors and reproduce DNN training without using real GPUs&lt;/li>
&lt;li>Organize and document the codes to make them reproducible for the community&lt;/li>
&lt;/ul></description></item><item><title>Reproduce and benchmark self-adaptive edge applications under dynamic resource management</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/edgebench/</link><pubDate>Thu, 02 Feb 2023 00:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/edgebench/</guid><description>&lt;p>With the flourishing of the ideas like smart cities or smart manufacturing, a massive amount of edge devices (e.g., traffic or security cameras, thermometers, flood sensors, et al.) are deployed and connected to the network to collect/analyze data across the space and time and help the stakeholders like city governments or manufacturers optimizing their plans and operations. Such a large number of edge devices and large amount of communications among the devicesdd or to the central servers rise a big challenge on how to manage/schedule the resource (i.e., network bandwidth between the devices and/or computing power on both edge devices and bare metal servers) to ensure the running applications&amp;rsquo; capability of providing a reliable service. Furthermore, with the nature of limited resources available to the edge devices, there is an uprising trend to reduce the average compute and/or bandwidth usage by leveraging the uneven distribution of interesting events with respect to both time and space in the input data. This brings further challenges for provisioning and managing the amount of resources available to the edge devices, as the running applications&amp;rsquo; resource demands can greatly depend on the input data which is both dynamic and unpredictable.&lt;/p>
&lt;p>With these challenges in mind, the team previously designed and implemented a dynamic resource manager that could understand the applications and make decisions based on such understanding at run time. This understanding is achieved based on a key insight - applications will have different magnitudes of performance improvement/degradation toward the change in the amount of resources available depending on the input data and how many resources the applications currently have, which we define as applications&amp;rsquo; sensitivities. However, such a resource manager has only been tested with a limited number and types of video analytic applications. Hence, through the OSRE23 project, we aim to:&lt;/p>
&lt;ol>
&lt;li>reproduce other state-of-art self-adaptive video analytic applications,&lt;/li>
&lt;li>integrate the reproducible applications into the resource manager framework,&lt;/li>
&lt;li>compare the performance with and without resource manager.&lt;/li>
&lt;/ol>
&lt;h3 id="reproducebenchmark-the-self-adaptive-video-analytic-applications-performance-under-dynamic-resource-management">Reproduce/benchmark the self-adaptive video analytic applications&amp;rsquo; performance under dynamic resource management&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Benchmark, Reproducibility, Video analytics, Machine Learning, Resource Management&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, PyTorch, TensorFlowd&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:junchenj@uchicago.edu">Junchen Jiang&lt;/a>, &lt;a href="mailto:yuyangh@uchicago.edu">Yuyang Huang&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/faishal-zharfan/">Faishal Zharfan&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Integrate various types of video analytic applications into the aforementioned dynamic resource manager and reproduce/benchmark the applications&amp;rsquo; performance.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Reproduce state-of-art video analytic applications&lt;/li>
&lt;li>Integrate such applications into the resource manager framework - Benchmark video analytic applications&lt;/li>
&lt;li>Analysis the benchmarked performance results&lt;/li>
&lt;/ul></description></item><item><title>FlashNet: Towards Reproducible Data Science for Storage System</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet/</link><pubDate>Thu, 02 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uchicago/flashnet/</guid><description>&lt;p>The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:&lt;/p>
&lt;ol>
&lt;li>How can we encourage data scientists to look into storage problems?&lt;/li>
&lt;li>How can we create a transparent platform that allows such decoupling?&lt;/li>
&lt;li>Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?&lt;/li>
&lt;/ol>
&lt;p>In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.&lt;/p>
&lt;p>In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.&lt;/p>
&lt;h3 id="building-flashnet-platform">Building FlashNet Platform&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage systems, reproducibility, machine learning, continual learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C++, Python, PyTorch, Experienced with Machine Learning pipeline&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/haryadi-s.-gunawi/">Haryadi S. Gunawi&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/justin-shin/">Justin Shin&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/maharani-ayu-putri-irawan/">Maharani Ayu Putri Irawan&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Reproduce the FlashNet evaluation results from prior works.&lt;/li>
&lt;li>Build and improve FlashNet components based on the existing blueprint.&lt;/li>
&lt;li>Collect and analyze the FlashNet evaluation results.&lt;/li>
&lt;/ul></description></item><item><title>Reproducible Analysis &amp; Models for Predicting Genomics Workflow Execution Time</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/</link><pubDate>Thu, 02 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/uga/genomicswfmodels/</guid><description>&lt;p>A high-throughput workflow execution system is needed to continuously gain insights from th e increasingly abundant genomics data. However, genomics workflows often have long execution times (e.g., hours to days) due to their large input files. This characteristic presents many complexities when managing systems for genomics workflow execution. Furthermore, based on our observation of a large-scale genomics data processing platform, ~2% of genomics workflows exhibit a tail behavior which multiplied their execution time up to 15x of the median, resulting in weeks of execution.&lt;/p>
&lt;p>On the other hand, input files for genomic workflows often vary in quality due to differences in how they are collected. Prior works suggested that these quality differences can affect genomics workflow execution time. Yet, to the best of our knowledge, input quality has never been accounted for in the design of a high-throughput workflow execution system. Even worse, there does not appear to be a consensus on what constitutes ‘input quality,’ at least from a computer systems perspective.&lt;/p>
&lt;p>In this project, we seek to analyze a huge dataset from a large-scale genomics processing platform in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times. Following that, we will build machine learning (ML) models for predicting workflow execution time, in particular those which exhibit tail behavior. We believe these insights and models can become the foundation for designing a novel tail-resilient genomics workflow execution system. Along the way, we will ensure that each step of our analysis is reproducible (e.g., in the form of Jupyter notebooks) and make all our ML models open-source (e.g., in the form of pre-trained models). We sincerely hope our work can offload some burdens commonly faced by operators of systems for genomics and, at the same time, benefit future researchers who work on the intersection of computer systems and genomics.&lt;/p>
&lt;h3 id="analyze-genomics-data-quality--build-exec-time-prediction-models">Analyze genomics data quality &amp;amp; build exec. time prediction models&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> genomics, data analysis, machine learning&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Linux, Python, Matplotlib, Pandas/Numpy, any ML library&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/in-kee-kim/">In Kee Kim&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/charis-christopher-hulu/">Charis Christopher Hulu&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Analyze a large-scale trace of genomics workflow execution along with metrics from various genomics alignment tools (e.g., FastQC, Picard, and GATK metrics) and find features that
correlate the most with workflow execution time and its tail behavior. Then, based on the results, we will build ML models that accurately predict genomic workflows’ execution times.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Acquire basic understanding of genomics data processing &amp;amp; workflow execution (will be guided by the mentor)&lt;/li>
&lt;li>Reproduce past analysis &amp;amp; models built by prior members of the project&lt;/li>
&lt;li>Propose features from FastQC/Picard/GATK metrics that can be used as a predictor for execution time and tail behavior&lt;/li>
&lt;li>Write a brief analysis as to why those features might work&lt;/li>
&lt;li>Build ML models for predicting execution time&lt;/li>
&lt;li>Package the analysis in the form of Jupyter notebooks&lt;/li>
&lt;li>Package the models in a reloadable format (e.g., pickle)&lt;/li>
&lt;/ul></description></item><item><title>Reproducible Evaluation of Multi-level Erasure Coding</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ornl/multilevelerasure/</link><pubDate>Thu, 02 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ornl/multilevelerasure/</guid><description>&lt;p>Massive storage systems rely heavily on erasure coding (EC) to protect data from drive failures and provide data durability. Existing storage systems mostly adopt single-level erasure coding (SLEC) to protect data, either performing EC at the network level or performing EC at the local level. However, both SLEC approaches have limitations, as network-only SLEC introduces heavy network traffic overhead, and local-only SLEC cannot tolerate rack failures.&lt;/p>
&lt;p>Accordingly, some data centers are starting to use multi-level erasure coding (MLEC), which is a hybrid approach performing EC at both the network level and the local level. However, prior EC research and evaluations mostly focused on SLEC, and it remains to be answered how MLEC is compared to SLEC in terms of durability, capacity overhead, encoding throughput, network traffic, and other overheads.&lt;/p>
&lt;p>Therefore, in this project we seek to build a platform to evaluate the durability and overheads of MLEC. The platform will allow us to evaluate dozens of EC strategies in many dimensions including recovery strategies, chunk placement choices, various parity schemes, etc. To the best of our knowledge, there is no other evaluation platform like what we propose here. We seek to make the platform open-source and the evaluation reproducible, allowing future researchers to benefit from it and conduct more research on MLEC.&lt;/p>
&lt;h3 id="building-a-platform-to-evaluate-mlec">Building a platform to evaluate MLEC&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Storage systems, reproducibility, erasure coding, evaluation&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Linux, C, Python&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-bent/">John Bent&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjus-george/">Anjus George&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/zhiyan-alex-wang/">Zhiyan &amp;quot;Alex&amp;quot; Wang&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Build a platform to evaluate the durability and overheads of MLEC. The platform will be able to evaluate different EC strategies in various dimensions including repair strategies, chunk placement choices, parity schemes, etc. Analyze the evaluation results.&lt;/p>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Reproduce the SLEC evaluation results from prior SLEC evaluation tools&lt;/li>
&lt;li>Based on prior SLEC evaluation tools, build a platform to evaluate the durability and overheads of MLEC under various EC strategies&lt;/li>
&lt;li>Collect and analyze the MLEC evaluation results&lt;/li>
&lt;/ul></description></item><item><title>Automatic Cluster Performance Shifts Detection Toolkit</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/anl/perfdrift/</link><pubDate>Wed, 01 Feb 2023 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/anl/perfdrift/</guid><description>&lt;p>High-performance computing (HPC) clusters typically suffer from performance degradation over time. The heterogeneous nature of clusters and the inevitable defects in various infrastructure layers will result in a harder performance prediction inside. On the other hand, when software upgrades or any such events happen, we might also observe performance improvement or degradation even though nothing changes in the hardware. Due to these uncertainties, it is necessary to send early notification to administrators of changes in cluster performance in a specific time window to inform scheduling decisions and increase cluster utilization.&lt;/p>
&lt;p>We are targeting HPC clusters that cater to heterogeneous, compute, and I/O intensive workloads that range from scientific simulation to AI model training that have high degree of parallelization in their workloads. In this scenario, we plan to use the Darshan open-source toolkit (&lt;a href="https://github.com/darshan-hpc/darshan" target="_blank" rel="noopener">https://github.com/darshan-hpc/darshan&lt;/a>) as data collection or profiling tools to design our performance drift algorithms. Furthermore, we will possibly incorporate the distribution shift detection into Darshan, making it viable as a notification to the HPC system administrators.&lt;/p>
&lt;p>Our goal is to show the efficacy of our algorithm by plotting the profiling data that display specific time windows where the performance shifts happened after being processed by our algorithm. Finally, we will package all our profiling data and experiment scripts inside Jupyter notebook, especially Chameleon Trovi, to help others reproduce our experiments.&lt;/p>
&lt;p>Through this research, we seek to contribute the following:&lt;/p>
&lt;ul>
&lt;li>Designing an algorithm to detect performance shifts in HPC clusters that can be adapted for heterogeneous workloads&lt;/li>
&lt;li>Real-time detection of the performance shifts without introducing great overheads into the system&lt;/li>
&lt;li>Contribute to Darshan to be able to automatically detect performance changes while profiling the clusters.&lt;/li>
&lt;/ul>
&lt;h3 id="automatic-and-adaptive-performance-shifts-detection">Automatic and Adaptive Performance Shifts Detection&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Statistical Machine Learning, Deep Learning, and High-Performance Computing (HPC)&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C++, Python, Statistics, good to have: Machine Learning, Deep learning&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> Sandeep Madireddy (&lt;a href="https://www.anl.gov/profile/sandeep-r-madireddy" target="_blank" rel="noopener">https://www.anl.gov/profile/sandeep-r-madireddy&lt;/a>, &lt;a href="http://www.mcs.anl.gov/~smadireddy/" target="_blank" rel="noopener">http://www.mcs.anl.gov/~smadireddy/&lt;/a> ), Ray Andrew Sinurat (&lt;a href="https://rayandrew.me" target="_blank" rel="noopener">https://rayandrew.me&lt;/a>)&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/kangrui-wang/">Kangrui Wang&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>All in all, these are the specific tasks that the student should do:&lt;/p>
&lt;ul>
&lt;li>Collaborate and work with mentors to understand the goal of this project.&lt;/li>
&lt;li>Implement distribution shift detection in pure statistical or machine/deep learning&lt;/li>
&lt;li>Deploy the algorithm and try to see its efficacy in the clusters.&lt;/li>
&lt;li>Package this experiment to make it easier for others to reproduce&lt;/li>
&lt;/ul></description></item><item><title>OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for VLSI Designs (2023)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsd/openroad/</link><pubDate>Wed, 01 Feb 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsd/openroad/</guid><description>&lt;p>The &lt;a href="https://theopenroadproject.org" target="_blank" rel="noopener">OpenROAD&lt;/a> project is a non-profit, DARPA-funded and Google sponsored project committed to creating low-cost and innovative Electronic Design Automation (EDA) tools and flows for IC design. Our mission is to democratize IC design, break down barriers of cost and access and mitigate schedule risk through native and open source innovation and collaboration with ecosystem partners. &lt;a href="https://github.com/The-OpenROAD-Project" target="_blank" rel="noopener">OpenROAD&lt;/a> provides an autonomous, no-human-in-the-loop, 24-hour, RTL-GDSII flow for fast ASIC design exploration, QoR estimation and physical implementation for a range of technologies above 12 nm. We welcome a diverse community of designers, researchers, enthusiasts, software engineers and entrepreneurs to use and contribute to OpenROAD and make a far-reaching impact. OpenROAD has been used in &amp;gt; 600 tapeouts across a range of ASIC applications with a rapidly growing and diverse user community.&lt;/p>
&lt;h3 id="enhance-openroad-gui-flow-manager">Enhance OpenROAD GUI Flow Manager&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>GUI&lt;/code>, &lt;code>Visualization&lt;/code>, &lt;code>User Interfaces&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: C++, Qt&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/matt-liberty/">Matt Liberty&lt;/a>, &lt;a href="mailto:ethanmoon@google.com">Ethan Mahintorabi&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Develop custom features for analysis and visualizations in the [OpenROAD GUI] (&lt;a href="https://openroad.readthedocs.io/en/latest/main/src/gui/README.html" target="_blank" rel="noopener">https://openroad.readthedocs.io/en/latest/main/src/gui/README.html&lt;/a>) to support native and third party flows. These include &lt;a href="https://github.com/The-OpenROAD-Project/OpenROAD-flow-scripts" target="_blank" rel="noopener">OpenROAD-flow-scripts&lt;/a>, &lt;a href="https://github.com/The-OpenROAD-Project/OpenLane" target="_blank" rel="noopener">OpenLane&lt;/a> and other third-party flows . Create documentation: commands, developer guide notes, tutorials to show GUI usage for supported flows.&lt;/p>
&lt;h3 id="profile-and-tune-openroad-flow-for-runtime-improvements">Profile and tune OpenROAD flow for Runtime improvements&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>OpenROAD-flow-scripts&lt;/code>, &lt;code>Flow Manager&lt;/code>, &lt;code>Runtime Optimization&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Knowledge about Computational resource optimization, Cloud-based computation, Basic VLSI design and tools knowledge&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/matt-liberty/">Matt Liberty&lt;/a>, &lt;a href="mailto:ethanmoon@google.com">Ethan Mahintorabi&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Test, analyze and develop verifiable and re-producible strategies to improve run times in &lt;a href="https://github.com/The-OpenROAD-Project/OpenROAD-flow-scripts" target="_blank" rel="noopener">OpenROAD-flow-scripts&lt;/a>. These include optimizations of computational resources over the cloud, tuning of algorithmic and design flow parameters. Create test plans using existing or new designs to show runtime improvements.&lt;/p>
&lt;h3 id="update-openroad-documentation-and-tutorials">Update OpenROAD Documentation and Tutorials&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Documentation&lt;/code>, &lt;code>Tutorials&lt;/code>, &lt;code>VLSI design basics&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Knowledge of EDA tools, basics of VLSI design flow, tcl, shell scripts, Documentation, Markdown&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/indira-iyer/">Indira Iyer&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/vitor-bandeira/">Vitor Bandeira&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jack-luar/">Jack Luar&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Review and update missing documentation and tutorials in &lt;a href="https://github.com/The-OpenROAD-Project/OpenROAD-flow-scripts" target="_blank" rel="noopener">OpenROAD-flow-scripts&lt;/a> for existing and new features. Here is an example Tutorial link: &lt;a href="https://openroad-flow-scripts.readthedocs.io/en/latest/tutorials/FlowTutorial.html" target="_blank" rel="noopener">https://openroad-flow-scripts.readthedocs.io/en/latest/tutorials/FlowTutorial.html&lt;/a> for reference.&lt;/p>
&lt;h3 id="lef-and-liberty-model-testing">LEF and Liberty Model Testing&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Testing&lt;/code>, &lt;code>LEF&lt;/code>, &amp;lsquo;LIB&amp;rsquo;, &lt;code>VLSI design basics&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Knowledge of EDA tools, basics of VLSI design, lef and lib model abstracts, tcl, shell scripts, Verilog, Layout&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/matt-liberty/">Matt Liberty&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Test the accuracy of generated LIB and LEF models for signoff in &lt;a href="https://github.com/The-OpenROAD-Project/OpenROAD-flow-scripts" target="_blank" rel="noopener">OpenROAD-flow-scripts&lt;/a> for flat and hierarchical design flows. Build test cases to validate and add to the regression suite.&lt;/p></description></item><item><title>Teaching Computer Networks with Reproducible Research</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/edunet/</link><pubDate>Wed, 18 Jan 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/edunet/</guid><description>&lt;p>Lead Mentor: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a>&lt;/p>
&lt;p>In the field of computer networks and wireless communication systems, the availability of open access networking and cloud computing testbeds (&lt;a href="https://portal.geni.net/" target="_blank" rel="noopener">GENI&lt;/a>, &lt;a href="https://cloudlab.us/" target="_blank" rel="noopener">CloudLab&lt;/a>, &lt;a href="https://chameleoncloud.org/" target="_blank" rel="noopener">Chameleon&lt;/a>, &lt;a href="https://fabric-testbed.net/" target="_blank" rel="noopener">FABRIC&lt;/a>, and others) has been transformative in promoting reproducible research &lt;em>and&lt;/em> in making high-quality experiential learning available to students and educators at a wide range of colleges and universities. This project seeks to unite research and education use of these testbeds by developing new ways of using reproducible research to teach computer networks and related topics.&lt;/p>
&lt;h3 id="bringing-foundational-results-into-the-classroom">Bringing foundational results into the classroom&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Computer networks, reproducibility, education&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Linux, writing&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> and TBD&lt;/li>
&lt;/ul>
&lt;p>To make foundational results from computer networks more concrete, this project seeks to reproduce a selection of key results and package them for use as interactive classroom demonstrations. (An example of a &amp;ldquo;foundational&amp;rdquo; result might be the result from the 1980s that motivates congestion control by showing how &lt;a href="http://dx.doi.org/10.1016/0169-7552%2889%2990019-6" target="_blank" rel="noopener">congestion collapse occurs when the network is under heavy load&lt;/a>.) This involves:&lt;/p>
&lt;ul>
&lt;li>Reproducing the original results on an open-access testbed&lt;/li>
&lt;li>Packaging the materials for use as a classroom demo, with interactive elements&lt;/li>
&lt;li>Creating assessment questions and sample &amp;ldquo;solutions&amp;rdquo; related to the materials, that instructors may use in homework assignments or exams&lt;/li>
&lt;/ul>
&lt;h3 id="developing-a-classroom-competition-for-adaptive-video-delivery-policies">Developing a &amp;ldquo;classroom competition&amp;rdquo; for adaptive video delivery policies&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Computer networks, adaptive video, reproducibility, education&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Linux, Python, writing&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> and TBD&lt;/li>
&lt;li>&lt;strong>Contributor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/srishti-jaiswal/">Srishti Jaiswal&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>A carefully designed competition can be a fun and exciting way for students to challenge themselves and gain &amp;ldquo;ownership&amp;rdquo; of a new topic. This projects builds on an existing open source &lt;a href="https://witestlab.poly.edu/blog/adaptive-video-reproducing/" target="_blank" rel="noopener">reproducible result&lt;/a> for adaptive video delivery, and will challenge students to extend this work and design their own adaptive video policies for head-to-head competition against their classmates. This includes:&lt;/p>
&lt;ul>
&lt;li>Packaging the result to make it easier for students to reproduce and then build on the original work&lt;/li>
&lt;li>Implementing other adaptive video policies from the literature, so that students can use them as a baseline&lt;/li>
&lt;li>Developing different network settings (using live link traces and emulated link patterns) in which student submissions may be evaluated&lt;/li>
&lt;li>Developing an evaluation framework for scoring student submissions on different criteria and in different network settings, and making the results available in a leaderboard format&lt;/li>
&lt;/ul></description></item><item><title>Using Reproducibility in Machine Learning Education</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml/</link><pubDate>Wed, 18 Jan 2023 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/nyu/eduml/</guid><description>&lt;p>Lead Mentor: &lt;a href="mailto:ffund@nyu.edu">Fraida Fund&lt;/a>&lt;/p>
&lt;p>The computer science and engineering classroom is as essential part of the reproducibility &amp;ldquo;ecosystem&amp;rdquo; - because of broad reach and potential for big impact, and because for many students, the classroom is their first exposure to research in their field. For machine learning in particular, reproducibility is an important element of the research culture, and can be a valuable part of any introductory or advanced courses in the field. These projects will develop highly interactive open educational resources, that may be adopted by instructors of graduate or undergraduate machine learning courses to incorporate more instruction about reproducibility and reproducible research.&lt;/p>
&lt;h3 id="introducing-levels-of-reproduction-and-replication-in-ml">Introducing &amp;ldquo;levels&amp;rdquo; of reproduction and replication in ML&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Machine learning, reproducibility, education&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Python, machine learning, writing&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> and TBD&lt;/li>
&lt;li>&lt;strong>Contributor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/mohamed-saeed/">Mohamed Saeed&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>In machine learning, replicating a published result to confirm the validity of the experimental results and the broader conclusions of the paper can take several forms, with increasing levels of effort:&lt;/p>
&lt;ul>
&lt;li>using authors&amp;rsquo; code and pre-trained weights, run the model on the same benchmarks as the original paper&lt;/li>
&lt;li>training a model using authors&amp;rsquo; code and published hyperparameters,&lt;/li>
&lt;li>training a model using authors&amp;rsquo; code and a new hyperparamter search,&lt;/li>
&lt;li>validating the authors&amp;rsquo; code e.g. with unit tests, in addition to training,&lt;/li>
&lt;li>re-implementing the model,&lt;/li>
&lt;li>designing additional experiments to validate that the suggested mechanism is in fact responsible for the result,&lt;/li>
&lt;li>and more.&lt;/li>
&lt;/ul>
&lt;p>This project will develop interactive materials (using one or more exemplar published results) to illustrate and to highlight relevant aspects and pitfalls of each of these &amp;ldquo;levels&amp;rdquo; of reproduction and replication.&lt;/p>
&lt;h3 id="packaging-existing-reproducible-results-for-the-ml-classroom">Packaging existing reproducible results for the ML classroom&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Machine learning, reproducibility, education&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Python, machine learning, writing&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: 350 hours&lt;/li>
&lt;li>&lt;strong>Mentor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> and TBD&lt;/li>
&lt;li>&lt;strong>Contribuor(s)&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/shekhar/">Shekhar&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jonathan-edwin/">Jonathan Edwin&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The goal is to make it easier for instructors to expose students to state-of-the-art research in the classroom. This project will work with an existing set of recent reproducible results in machine learning, and will package them for easier consumption by students and more effective use in the classroom. This may include, but is not necessarily limited to:&lt;/p>
&lt;ul>
&lt;li>Re-validating the result and re-packaging along with computational environment on an open access testbed&lt;/li>
&lt;li>Creating tutorial material around the result, including interactive visualizations to demonstrate key elements of the work&lt;/li>
&lt;li>Creating one-click demos for applying the model/technique to a new test sample&lt;/li>
&lt;li>Curating test samples to highlight important advantages and limitations of the result&lt;/li>
&lt;li>Creating assessment questions and sample &amp;ldquo;solutions&amp;rdquo; that instructors may use to &amp;ldquo;assign&amp;rdquo; the work to students&lt;/li>
&lt;/ul></description></item><item><title>Public Artifact Data and Visualization</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/intel/artifactviz/</link><pubDate>Mon, 09 Jan 2023 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/intel/artifactviz/</guid><description>&lt;p>Reproducibility and Artifact Evaluation efforts have focused on reproducing the results, but not necessarily on storing, visualizing and making the results accessible. This set of projects builds the initial building blocks to log, capture, and visualize experiments.&lt;/p>
&lt;h3 id="experiment-log">Experiment Log&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Provide tools to log experiments&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Simple&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjo-vahldiek-oberwagner/">Anjo Vahldiek-Oberwagner&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Develop a client and server side tool to start/stop an experiment, timestamp the experiment. Document each iteration of the experiment and create a database to visualize the log of experiments.&lt;/p>
&lt;h3 id="capture-hwsw-state--continuous-monitoring">Capture HW/SW state &amp;amp; continuous monitoring&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Record initial state&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjo-vahldiek-oberwagner/">Anjo Vahldiek-Oberwagner&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Provide simple tools to gather the initial state of each experimental machine and its connected devices, configurations, software versions, &amp;hellip; Upload into the experiment log database and visualize the recorded data. Ideally, provide diff function between experimental runs.&lt;/p>
&lt;p>In a second step, monitor the machine’s state during the execution. This includes, network, memory, CPU, general OS statistics.&lt;/p>
&lt;h3 id="record-and-visualize-experimental-results">Record and visualize experimental results&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Record results in various formats and visualize them&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/anjo-vahldiek-oberwagner/">Anjo Vahldiek-Oberwagner&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jiayuan-zhu/">Jiayuan Zhu&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/krishna-madhwani/">Krishna Madhwani&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Description: Experiments generate results in various formats (e.g., CSV, json, text files, …). The goal of this project is to provide tools to extract common formats, connect the results to the experiment log and visualize them. Ideally, allowing to compare different experimental runs. Initially, the project could dump their results into a Prometheus instance (&lt;a href="https://prometheus.io/" target="_blank" rel="noopener">https://prometheus.io/&lt;/a>) which would later become available for everyone to explore the data.&lt;/p></description></item><item><title>Polyphorm / PolyPhy</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/polyphy/</link><pubDate>Thu, 15 Dec 2022 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/polyphy/</guid><description>&lt;p>&lt;a href="https://github.com/PolyPhyHub/PolyPhy" target="_blank" rel="noopener">PolyPhy&lt;/a> is a GPU oriented agent-based system for reconstructing and visualizing &lt;em>optimal transport networks&lt;/em> defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called &lt;a href="https://github.com/CreativeCodingLab/Polyphorm" target="_blank" rel="noopener">Polyphorm&lt;/a> to reconstruct the &lt;a href="https://youtu.be/5ILwq5OFuwY" target="_blank" rel="noopener">Cosmic web&lt;/a> structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our &lt;a href="https://elek.pub/workshop_cross2022.html" target="_blank" rel="noopener">workshop&lt;/a> and more details about our research &lt;a href="https://elek.pub/projects/Rhizome-Cosmology" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;p>Under the hood, PolyPhy uses a richer 3D scalar field representation of the reconstructed network, instead of a typical discrete representation like a graph or a mesh. The ultimate purpose of PolyPhy is to become a toolkit for a range of specialists across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. PolyPhy aspires to be a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.&lt;/p>
&lt;h3 id="polyphy-infrastructure-engineering-and-practices">PolyPhy infrastructure engineering and practices&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>DevOps&lt;/code> &lt;code>Code Refactoring&lt;/code> &lt;code>CI/CD&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> fluidity in Python, experience with OOP, experience with building and packaging libraries, understanding GitHub and its tools ecosystem&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Challenging&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350+ hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="mailto:anishagoel14@gmail.com">Anisha Goel&lt;/a>&lt;/li>
&lt;li>&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/prashant-jha/">Prashant Jha&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Your responsibility in this project will be developing new infrastructure of the PolyPhy project as well as maintaining the existing &lt;a href="https://github.com/PolyPhyHub/" target="_blank" rel="noopener">codebases&lt;/a>. This is a multifaceted role that will require coordination with the team and active approach to understanding the technical needs of the community.&lt;/p>
&lt;p>&lt;strong>Specific tasks:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Work with the technical lead to develop effective interfaces for PolyPhy, providing access to its functionality on the level of both Python/Jupyter code and the command line.&lt;/li>
&lt;li>Maintain the existing &lt;a href="https://github.com/PolyPhyHub/PolyPhy" target="_blank" rel="noopener">codebase&lt;/a> and configure it according to the team&amp;rsquo;s needs.&lt;/li>
&lt;li>Develop and extend the current CI/CD functionality and related code metrics.&lt;/li>
&lt;li>Document the best practices related to the above.&lt;/li>
&lt;/ul>
&lt;h3 id="write-polyphys-technical-story-and-content">Write PolyPhy&amp;rsquo;s technical story and content&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Writing&lt;/code> &lt;code>Documentation&lt;/code> &lt;code>Storytelling&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> experienced writing structured text, well read, technical or scientific education, webdev basics (preferably NodeJS)&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 350 hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="mailto:ez@nmsu.edu">Ezra Huscher&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Integral to PolyPhy&amp;rsquo;s presentation is a &amp;ldquo;story&amp;rdquo; - a narrative understanding - that the users and the project contributors can relate to. Your responsibility will be to develop the written part of that understanding, as well as major portions of technical documentation that match it.&lt;/p>
&lt;p>&lt;strong>Specific tasks:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Work with mentors on understanding the context of the project.&lt;/li>
&lt;li>Write and edit diverse pages of the project &lt;a href="https://www.polyphy.io" target="_blank" rel="noopener">website&lt;/a>.&lt;/li>
&lt;li>Work with mentors to improve project&amp;rsquo;s written community practices (diversity, communication).&lt;/li>
&lt;li>Write and edit narrative and explanatory parts of PolyPhy&amp;rsquo;s documentation.&lt;/li>
&lt;li>Create tutorials that present core functionality of the toolkit.&lt;/li>
&lt;/ul>
&lt;h3 id="community-engagement-and-management">Community engagement and management&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Community Management&lt;/code> &lt;code>Social Media&lt;/code> &lt;code>Networking&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> documented experience with current social media landscape, social and well spoken, ability to communicate technical concepts&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> 175 or 350 hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/oskar-elek/">Oskar Elek&lt;/a>, &lt;a href="mailto:ez@nmsu.edu">Ezra Huscher&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Your responsibility will be to build and engage the community around PolyPhy. This includes its standing team and stakeholders, current expert users, potential adopters as well as the general public. The scope (size) of the project depends on the level of commitment during and beyond the Summer and is negotiable upfront.&lt;/p>
&lt;p>&lt;strong>Specific tasks:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Manage the team&amp;rsquo;s communication channels (Slack, Zoom, email) and maintain active presence therein.&lt;/li>
&lt;li>Develop social media presence for PolyPhy on Twitter, LinkedIn and other selected social media platforms.&lt;/li>
&lt;li>Manage and extend the online presence for the project, including its &lt;a href="https://polyphy.io" target="_blank" rel="noopener">website&lt;/a>, mailing list, and other applicable outreach activities.&lt;/li>
&lt;li>Research and engage with new communities that would benefit from PolyPhy, both as its expert users and contributors.&lt;/li>
&lt;/ul></description></item><item><title>FasTensor</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lbl/fastensor/</link><pubDate>Mon, 07 Nov 2022 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/lbl/fastensor/</guid><description>&lt;p>&lt;a href="https://sdm.lbl.gov/fastensor/" target="_blank" rel="noopener">FasTensor&lt;/a> is a parallel execution engine for user-defined functions on multidimensional arrays. The user-defined functions follow the stencil metaphor used for scientific computing and is effective for expressing a wide range of computations for data analyses, including common aggregation operations from database management systems and advanced machine learning pipelines. FasTensor execution engine exploits the structural-locality in the multidimensional arrays to automate data management operations such as file I/O, data partitioning, communication, parallel execution, and so on.&lt;/p>
&lt;h3 id="tensor-execution-engine-on-gpu">Tensor execution engine on GPU&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Data Management&lt;/code>, &lt;code>Analytics&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: C++, github&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Difficult&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-wu/">John Wu&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/bin-dong/">Bin Dong&lt;/a>, &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Tensor based computing is needed by scientific applications and now advanced AI model training. Most tensor libraries are hand customized and optimized on GPU, and most of they only serve one kind of application. For example, TensorFlow is only optimized for AI model training. Optimizing generic tensor computing libraries on GPU can benefit wide applications. Our FasTensor, as a generic tensor computing library, can only work efficiently on CPU now. How to run the FasTensor on GPU is still none-explored work. Research and development challenges will include but not limited to: 1) how to maintain structure-locality of tensor data on GPU; 2) how to reduce the performance loss when the structure-locality of tensor is broken on GPU.&lt;/p>
&lt;ul>
&lt;li>Develop a mechanism to move user-define computing kernels onto GPU&lt;/li>
&lt;li>Evaluate the performance of the execution engine&lt;/li>
&lt;li>Document the execution mechanism&lt;/li>
&lt;li>Develop performance testing suite&lt;/li>
&lt;/ul>
&lt;h3 id="continuous-integration">Continuous Integration&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Data Management&lt;/code>, &lt;code>Analytics&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: C++, github&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (300 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/john-wu/">John Wu&lt;/a>, &lt;a href="mailto:dbin@lbl.gov">Bin Dong&lt;/a>, &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>Develop a test suite for the public API of FasTensor&lt;/li>
&lt;li>Automate execution of the test suite&lt;/li>
&lt;li>Document the continuous integration process&lt;/li>
&lt;/ul></description></item><item><title>LiveHD (2023)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/livehd/</link><pubDate>Mon, 07 Nov 2022 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/livehd/</guid><description>&lt;p>Projects for &lt;a href="https://github.com/masc-ucsc/livehd" target="_blank" rel="noopener">LiveHD&lt;/a>.&lt;br>
Lead Mentors: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jose-renau/">Jose Renau&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/sakshi-garg/">Sakshi Garg&lt;/a>.&lt;br>
Contributor(s): &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/shahzaib-kashif/">Shahzaib Kashif&lt;/a>&lt;/p>
&lt;p>LiveHD is a &amp;ldquo;compiler&amp;rdquo; infrastructure for hardware design optimized for synthesis and simulation. The goals is to enable a more productive flow where the ASIC/FPGA designer can work with multiple hardware description languages like CHISEL, Pyrope, or Verilog.&lt;/p>
&lt;p>There are several projects available around LiveHD. A longer explanation and more project options are available at
&lt;a href="https://github.com/masc-ucsc/livehd/blob/master/docs/projects.md" target="_blank" rel="noopener">projects&lt;/a>. Contact the
mentors to find a project that fits your interests.&lt;/p>
&lt;p>A sample of helpful projects:&lt;/p>
&lt;h3 id="mockturtle">Mockturtle&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Title&lt;/td>
&lt;td>Mockturtle&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Description&lt;/td>
&lt;td>Perform synthesis for graph in LiveHD using Mockturtle&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Mentor(s)&lt;/td>
&lt;td>Jose Renau and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/sakshi-garg/">Sakshi Garg&lt;/a>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Skills&lt;/td>
&lt;td>C++17, synthesis&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Difficulty&lt;/td>
&lt;td>Medium&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Size&lt;/td>
&lt;td>Medium 175 hours&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://github.com/masc-ucsc/livehd/blob/master/docs/projects_large.md#medium-parallel-and-hierarchical-synthesis-with-mockturtle" target="_blank" rel="noopener">Link&lt;/a>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Mockturtle (&lt;a href="https://github.com/lsils/mockturtle" target="_blank" rel="noopener">https://github.com/lsils/mockturtle&lt;/a>) is a synthesis tool partially
integrated with LiveHD. The goal of this task is to iron out bugs and issues
and to use the LiveHD Tasks API to parallelize the synthesis.&lt;/p>
&lt;p>Main features:&lt;/p>
&lt;ul>
&lt;li>The current synthesis divides the circuit in partitions. Each partition can be synthesized in parallel.&lt;/li>
&lt;li>Support hierarchical synthesis to optimize cross Lgraphs (cross verilog module optimization)&lt;/li>
&lt;/ul>
&lt;p>The goal is to use Mockturtle (&lt;a href="https://github.com/lsils/mockturtle" target="_blank" rel="noopener">https://github.com/lsils/mockturtle&lt;/a>) with LiveHD. The main characteristics:&lt;/p>
&lt;ul>
&lt;li>Use mockturtle to tmap to LUTs&lt;/li>
&lt;li>Use mockturtle to synthesize (optimize) logic&lt;/li>
&lt;li>Enable cut-rewrite as an option&lt;/li>
&lt;li>Enable hierarchy cross optimization (hier:true option)&lt;/li>
&lt;li>Use the graph labeling to find cluster to optimize&lt;/li>
&lt;li>Re-timing&lt;/li>
&lt;li>Map to LUTs only gates and non-wide arithmetic. E.g: 32bit add is not mapped to LUTS, but a 2-bit add is mapped.&lt;/li>
&lt;li>List of resources to not map:
&lt;ul>
&lt;li>Large ALUs. Large ALUs should have an OpenWare block (hardcoded in FPGAs and advanced adder options in ASIC)&lt;/li>
&lt;li>Multipliers and dividers&lt;/li>
&lt;li>Barrell shifters with not trivial shifts (1-2 bits) selectable at run-time&lt;/li>
&lt;li>memories, luts&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="livehd-console">LiveHD Console&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Title&lt;/td>
&lt;td>LiveHD Console&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Description&lt;/td>
&lt;td>Create a console app that interacts with LiveHD to query parameters about designs&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Mentor(s)&lt;/td>
&lt;td>Jose Renau and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/sakshi-garg/">Sakshi Garg&lt;/a>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Skills&lt;/td>
&lt;td>C++17&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Difficulty&lt;/td>
&lt;td>Medium&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Size&lt;/td>
&lt;td>Medium 175 hours&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://github.com/masc-ucsc/livehd/blob/master/docs/projects_small.md#medium-query-shell-not-lgshell-to-query-graphs" target="_blank" rel="noopener">Link&lt;/a>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Current LiveHD uses replxx but it a no longer maintained shell/console. The result is that it fails in newer versions of OSX.&lt;/p>
&lt;p>There is an alternative Crossline (&lt;a href="https://github.com/jcwangxp/Crossline%29" target="_blank" rel="noopener">https://github.com/jcwangxp/Crossline)&lt;/a>. This affects main/main.cpp and nothing else.&lt;/p>
&lt;p>In addition to replace the current console with auto-completion, the plan is to add &amp;ldquo;query&amp;rdquo; capacity to visualize some
of the LiveHD internals.&lt;/p>
&lt;ul>
&lt;li>Query bits, ports&amp;hellip; like
&lt;ul>
&lt;li>&lt;a href="https://github.com/rubund/netlist-analyzer" target="_blank" rel="noopener">https://github.com/rubund/netlist-analyzer&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.jameswhanlon.com/querying-logical-paths-in-a-verilog-design.html" target="_blank" rel="noopener">https://www.jameswhanlon.com/querying-logical-paths-in-a-verilog-design.html&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>It would be cool if subsections (selected) parts can be visualized with something like &lt;a href="https://github.com/nturley/netlistsvg" target="_blank" rel="noopener">https://github.com/nturley/netlistsvg&lt;/a>&lt;/li>
&lt;li>The shell may be expanded to support simulation in the future&lt;/li>
&lt;li>Wavedrom/Duh dumps&lt;/li>
&lt;/ul>
&lt;p>Wavedrom and duh allows to dump bitfield information for structures. It would be interesting to explore to dump tables and bit
fields for Lgraph IOs, and structs/fields inside the module. It may be a way to integrate with the documentation generation.&lt;/p>
&lt;p>Example of queries: show path, show driver/sink of, do topo traversal,&amp;hellip;.&lt;/p>
&lt;h3 id="compiler-error-generation-pass">Compiler error generation pass&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>&lt;/th>
&lt;th>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Title&lt;/td>
&lt;td>Lgraph and LNAST check pass&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Description&lt;/td>
&lt;td>Create a pass that check the integrity/correctness of Lgraph and LNAST&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Mentor(s)&lt;/td>
&lt;td>Jose Renau and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/sakshi-garg/">Sakshi Garg&lt;/a>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Skills&lt;/td>
&lt;td>C++17&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Difficulty&lt;/td>
&lt;td>Medium&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Size&lt;/td>
&lt;td>Large 350 hours&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;a href="https://github.com/masc-ucsc/livehd/blob/master/docs/projects_small.md#medium-diagnostics" target="_blank" rel="noopener">Link&lt;/a>&lt;/td>
&lt;td>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Create a pass that checks that the Lgraph (and/or LNAST) is semantically
correct. The LNAST already has quite a few tests (pass.semantic), but it can be
further expanded. Some checks:&lt;/p>
&lt;ul>
&lt;li>No combinational loops&lt;/li>
&lt;li>No mismatch in bit widths&lt;/li>
&lt;li>No disconnected nodes&lt;/li>
&lt;li>Check for inefficient splits (do not split buses that can be combined)&lt;/li>
&lt;li>Transformations stages should not drop names if same net is preserved&lt;/li>
&lt;li>No writes in LNAST that are never read&lt;/li>
&lt;li>All the edges are possible. E.g: no pin &amp;lsquo;C&amp;rsquo; in Sum_op&lt;/li>
&lt;/ul></description></item><item><title>Open Source Autonomous Vehicle Controller</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/osavc/</link><pubDate>Mon, 07 Nov 2022 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/osavc/</guid><description>&lt;p>The OSAVC is a vehicle-agnostic open source hardware and software project. This project is designed to provide a real-time hardware controller adaptable to any vehicle type, suitable for aerial, terrestrial, marine, or extraterrestrial vehicles. It allows control researchers to develop state estimation algorithms, sensor calibration algorithms, and vehicle control models in a modular fashion such that once the hardware set has been developed switching algorithms requires only modifying one C function and recompiling.&lt;/p>
&lt;p>Lead mentor: &lt;a href="mailto:aamuhunt@ucsc.edu">Aaron Hunter&lt;/a>&lt;/p>
&lt;p>Projects for the OSAVC:&lt;/p>
&lt;h3 id="vehiclecraft-sensor-driver-development">Vehicle/Craft sensor driver development&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Driver code to integrate sensor to a microcontroller&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: C, I2C, SPI, UART interfaces&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong> 175 hours&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong> &lt;a href="mailto:aamuhunt@ucsc.edu">Aaron Hunter&lt;/a>, &lt;a href="mailto:caiespin@ucsc.edu">Carlos Espinosa&lt;/a>, Pavlo Vlastos&lt;/li>
&lt;/ul>
&lt;p>Help develop sensor libraries for use in autonomous vehicles. We are in particular interested in sensors for UAVs: airspeed sensors (pitot tube) or barometers, but also proximity detectors (ultrasonic), and range sensors. Code will be written in C using state machine methodology and non-blocking algorithms. Test the drivers on a Microchip microncontroller.&lt;/p>
&lt;h3 id="technical-documentation">Technical Documentation&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Documentation&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Technical writing, markdown language, website&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong> 175 hours&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong> Aaron Hunter/Carlos Espinosa/Pavlo Vlastos&lt;/li>
&lt;li>&lt;strong>Contributor(s)&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/aniruddha-thakre/">Aniruddha Thakre&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Technical Documentation:
Write a tutorial to demonstrate how to start with an OSAVC and program it with the robotic equivalent of HelloWorld, moving onto more sophisticated applications. Create a web page interface to the OSAVC repo highlighting this tutorial. In this project you will start from scratch with an OSAVC PCB and bring it to life, while documenting it in a way to help new users.&lt;/p>
&lt;h3 id="rosgazebo-robot-simulation">ROS/Gazebo Robot Simulation&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Robot simulation with ROS/Gazebo&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong> ROS/Gazebo, Python&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong> 175 or 350 hours&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong> Medium to Hard&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong> &lt;a href="mailto:aamuhunt@ucsc.edu">Aaron Hunter&lt;/a>, &lt;a href="mailto:caiespin@ucsc.edu">Carlos Espinosa&lt;/a>, Pavlo Vlastos&lt;/li>
&lt;li>&lt;strong>Contributor(s)&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/damodar-datta-kancharla/">Damodar Datta Kancharla&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Generate a simulated world and a quadcopter model in ROS/Gazebo. Provide a link from Mavlink to ROS using the mavros package and simulate a real vehicle data stream to command the simulated quadcopter in Gazebo. At the same time return the image stream from Gazebo to allow for offline processing of ML models on the images.&lt;/p></description></item><item><title>Writing a blog about your OSRE 2023 project</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/admin/20221106-admin/</link><pubDate>Sun, 06 Nov 2022 11:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre23/ucsc/admin/20221106-admin/</guid><description>&lt;p>Starting in 2023 the Organization Admins will be asking students and contributors to provide regular status updates which will help us better highlight the work you are doing and track activities within our OSRE projects. These progress reports will also form the basis of blog reports prepared by students in the course of their summer. Blog reports should include links to proposals, presentations, reports, and an overview of the student&amp;rsquo;s experience.&lt;/p>
&lt;p>Your experience is invaluable for future OSRE candidates and for improving the program every year.&lt;/p>
&lt;h2 id="size-and-content">Size and content&lt;/h2>
&lt;p>Keep it short and crisp. Include a short description of your project, a link to your project proposal, and, later in the program, links to the GSoC reports you provided.&lt;/p>
&lt;h2 id="making-a-pull-request-for-your-blog">Making a pull request for your blog&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>Fork the &lt;a href="https://github.com/ucsc-ospo/ucsc-ospo.github.io" target="_blank" rel="noopener">git repository&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>If you haven&amp;rsquo;t already done so, add your profile using &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/osredocs/formentors/#instructions-for-adding-a-mentor">these instructions&lt;/a>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>IMPORTANT&lt;/strong>: Under &lt;code>user_groups:&lt;/code> add &lt;code>- 2023 Contributors&lt;/code> (as opposed to any of the two mentor groups)&lt;/li>
&lt;li>The short bio and any other information goes below the frontmatter&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Post your blog&lt;/p>
&lt;ul>
&lt;li>Add &lt;code>/content/report/osre23/ORGANIZATION/PROJECTNAME/DATE-USERNAME/index.md&lt;/code>&lt;/li>
&lt;li>Add a frontmatter to &lt;code>index.md&lt;/code>, using the labels below&lt;/li>
&lt;li>Blog text goes below the frontmatter&lt;/li>
&lt;li>In that same directory include a picture and call it &lt;code>featured.png&lt;/code> (also supports &lt;code>.jpg&lt;/code>, &lt;code>.jpeg&lt;/code>)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Commit to your fork and make a pull request and &lt;a href="mailto:ospo-info-group@ucsc.edu/">email OSRE Admins&lt;/a> (currently: Stephanie Lieggi, Carlos Maltzahn).&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="example-frontmatter-and-text-body">Example frontmatter and text body&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">---
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">title: &amp;#34;YOUR TITLE&amp;#34;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">subtitle: &amp;#34;YOUR SUBTITLE (OPTIONAL)&amp;#34;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">summary:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">authors:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - USERNAME1
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> - USERNAME2
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">tags: [&amp;#34;osre23&amp;#34;]
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">categories: []
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">date: YYYY-MM-DD
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">lastmod: YYYY-MM-DD
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">featured: false
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">draft: false
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"># Featured image
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"># To use, add an image named `featured.jpg/png` to your page&amp;#39;s folder.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"># Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">image:
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> caption: &amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> focal_point: &amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> preview_only: false
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">---
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">As part of the [PROJECTNAME](/project/osre23/ORGANIZATION/PROJECTNAME) my [proposal](https://...) under the mentorship of MENTOR aims to ...
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div></description></item><item><title>Efficient Communication with Key/Value Storage Devices</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/kvstore/</link><pubDate>Sun, 27 Feb 2022 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/kvstore/</guid><description>&lt;p>Network key value stores are used throughout the cloud as a storage backends (eg AWS ShardStore) and are showing up in devices (eg NVMe KV SSD). The KV clients use traditional network sockets and POSIX APIs to communicate with the KV store. An advancement that has occurred in the last 2 years is a new kernel interface that can be used in lieu of the POSIX API, namely &lt;code>io_uring&lt;/code>. This new interface uses a set of shared memory queues to provide for kernel-to-user communication and permits zero copy transfer of data. This scheme avoids the overhead of system calls and can improve performance.&lt;/p>
&lt;h3 id="implement-io_uring-communication-backend">Implement &lt;code>io_uring&lt;/code> communication backend&lt;/h3>
&lt;p>&lt;strong>Topics:&lt;/strong> performance, I/O, network, key-value, storage&lt;br>
&lt;strong>Difficulty:&lt;/strong> Medium&lt;br>
&lt;strong>Size:&lt;/strong> Medium or large (120 or 150 hours)&lt;br>
&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:philip.kufeldt@seagate.com">Philip Kufeldt (Seagate)&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/aldrin-montana/">Aldrin Montana&lt;/a> (UC Santa Cruz)
&lt;strong>Contributor(s):&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/manank-patel/">Manank Patel&lt;/a>&lt;/p>
&lt;p>Seagate has been using a network-based KV HDD as a research vehicle for computational storage. This research vehicle uses open-source user library that implements a KV API by sending network protobuf-based RPCs to a network KV store. Currently it is implemented with the standard socket and POSIX APIs to communicate with the KV backend. This project would implement an &lt;code>io_uring&lt;/code> communication backend and compare the results of both implementations.&lt;/p></description></item><item><title>DirtViz 2.0 (2023)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/dirtviz/</link><pubDate>Mon, 07 Feb 2022 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre23/ucsc/dirtviz/</guid><description>&lt;p>DirtViz is a project to visualize data collected from sensors deployed in sensor networks. We have deployed a number of sensors measuring qualities like soil moisture, temperature, current and voltage in outdoor settings. This project involves extending our existing visualization stack, DirtViz 1.0 (see github), and expanding it to version 2.0. The project goal is to create a fully-fledged dataviz tool tailored to the types of data collected from embedded systems sensor networks.&lt;/p>
&lt;h3 id="visualize-sensor-data">Visualize Sensor Data&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> Data Visualization, Analytics&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> javascript, python, bash, webservers, git, embedded systems&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Easy/Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large, 350 hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/colleen-josephson/">Colleen Josephson&lt;/a>, &lt;a href="mailto:sonaderi@ucsc.edu">Sonia Naderi&lt;/a>, &lt;a href="mailto:sgtaylor@ucsc.edu">Stephen Taylor&lt;/a>, &lt;a href="mailto:jtmadden@ucsc.edu">John Madden&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Specific tasks:&lt;/p>
&lt;ul>
&lt;li>Refine our web-based visualization tools to easily allow users to zoom in on date ranges, change axes, etc.&lt;/li>
&lt;li>Create a system for remote collaborators/citizen scientists to upload their own data in a secure manner&lt;/li>
&lt;li>Craft an intuitive navigation system so that data from deployment sites around the world can be easily viewed&lt;/li>
&lt;li>Document the tool thoroughly for future maintenance&lt;/li>
&lt;li>If interested, we are also open to you investigating correlations between different data streams and doing self-directed data analysis&lt;/li>
&lt;/ul></description></item></channel></rss>