GSoC'23 | UCSC OSPO

Final GSoC Blog - Polyglot

Mon, 25 Sep 2023 00:00:00 +0000

As I send in my final work submission for the final GSoC evaluation, I’m excited to share with you the progress we’ve made this summer (and future plans for Polyglot!). You can view the repository and web app here: https://polyphyhub.github.io/PolyGlot/. As a quick reminder of the project, we sought to extend the Polyglot web app, as developed by Hongwei (Henry) Zhou. For context, the web app follows this methodology:

Given a set of words, use an embedding model (such as Word2Vec, BERT, etc.) to generate a set of high dimensional points associated with each word.
Use a dimensionality reduction method (such as UMAP) to reduce the dimensionality of each word-vector point to 3 dimensions
Use the novel MCPM (Monte Carlo Physarum Machine) to compute the similarities between a set of anchor points and the rest of the point cloud. You could use any similarity metric here, too, such as the Euclidean distance.
The web app then displays the point cloud of 3-dimensional embeddings, but uses coloring to indicate the level of MCPM similarity each word has with the anchor point (e.g, if the anchor point is the word “dog”, the rest of the point cloud is colored such that words identified as similar to “dog” by the MCPM metric are brighter, whereas dissimilar words are darker.

The main results since the last blog are summarized as follows:

Novel timeline feature in which users can track the importance of certain words over time by watching the change in size of points (computes the IF-IDF metric for a word across all documents in a given year). Uses linear interpolation for years which do not have an explicit importance score.
An industrial collaboration with UK startup Lautonomy, where we have pre-processed and entered their data into Polyglot. Pre-processing consisted of first computing a high dimensional embedding of their set of words using OpenAI’s CLIP model https://openai.com/research/clip and the CLIP-as-service Python package https://clip-as-service.jina.ai. Next, we used UMAP to reduce the dimensionality of these embeddings to 3D. We computed the Euclidean distance on this data (in place of MCPM metric). Finally, we formatted the data to enter into Polyglot.

Although the app has developed a lot over the summer, we are planning to continue working on Polyglot, particularly with respect to one of our original goals: to set up a pipeline from PolyPhy to Polyglot. Unfortunately, with PolyPhy undergoing refactoring this summer, we weren’t able to set this pipeline up. However, that is one of our goals for the next few months. We are also moving forward with the industrial collaboration with legal analytics startup Lautonomy. We hope to release an output together soon!

If you’re curious about Polyglot or are interesting in getting involved, please feel free to reach out to myself, Oskar Elek, and Jasmine Otto!

KV store final Blog

Fri, 25 Aug 2023 00:00:00 +0000

Hello again! Before we get started, take a look at my previous blogs, Introduction and Mid Term. The goal of the project was to implement io_uring based backend driver for client side, which was at that time using traditional sockets. The objective was improving performance from the zero copy capabilities of io uring. In the process, I learnt about many things, about libkinetic and KV stores in general.

I started by writing a separate driver using io_uring in libkinetic/src in ktli_uring.c, most of which is similar to the sockets backend in ktli_sockets.c. The only difference was in the send and receive functions. For more detailed description about the implementation, refer to the mid term blog.

After the implementation, it was time to put it to test. We ran extensive benchmarks with a tool called fio, which is generally used to run tests on filesystems and other IO related things. Thanks to Philip, who had already written an IO engine for testing kinetic KV store (link), I didn’t have much problem in setting up the testbench. Again thanks to Philip, He set up a ubuntu server with the kinetic server and gave me access through ssh. We ran extensive tests on that server, with both socket and uring backends, with several different block sizes. The link to the benchmarks sheet can be found here.

We spent a lot of time in reading and discussing the numbers, probably the most time consuming part of the project, we had several long discussions analyzing numbers and their implications, for example in the initial tests, we were getting very high std dev in mean send times, then we figured it was because of the network bottleneck, as we were using large block sizes and filling up the 2.5G network bandwidth quickly.

In conclusion, we found out that there are many other major factors affecting the performance of the KV store, for example the network, and the server side of the KV store. Thus, though io_uring offers performance benefit at the userspace-kernel level, in this case, there were other factors that had more significant effect than the kernal IO stack on the client side. Thus, for increasing the performance, we need to look at the server side

I would like to thank Philip and Aldrin for their unwavering support and in depth discussions on the topic in our weekly meetings, I learned a lot from them throughout the entire duration of the project.

Midpoint Blog Interactive Exploration of High-dimensional Datasets with PolyPhy and Polyglot

Thu, 03 Aug 2023 00:00:00 +0000

The last few months of my GSoC project have been very exciting and I hope to share why with you here in this blog post! To briefly summarize, my project has been focused on further developing the Polyglot app, a tool for visualizing 3D language embeddings. One important part of Polyglot is its utilization of the novel MCPM metric, where points are colored according to their MCPM similarity to a user-chosen “anchor point” (e.g., if “hat” is our anchor point, then similar words like “cap” or “fedora” will be colored more prominently).

The first issue we wanted to tackle was actually navigating the point cloud. With hundreds of thousands of points, it can be difficult to find what you’re looking for! Thus, the first few features added were a search bar for points and anchor points and a “jump to point” feature which changes a user’s center of rotation and “jumps” to a chosen point. There were a few hiccups with implementing these features, mainly due to the large number of points and the particular quirks of the graphics library Polyglot uses. In the end though, these simple features made it feel a lot easier to use Polyglot.

The next set of features related to our desire to actually annotate the point cloud. Similar to how one might annotate a Google doc (ie., highlight a chunk of text and leave a comment), we wanted to set up something similar, but with points! Indeed, this led to the development of a cool brush tool for coloring points, named and commented annotations (up to 5), a search bar within annotations, and finally a button to export annotations and comments to a CSV.

The next few weeks are looking bright as we strive to finish the PolyPhy-Polyglot pipeline (a notebook for quickly formatting MCPM data from PolyPhy and getting it into Polyglot). We also hope to add a unique “timeline” feature in which users can analyze sections of the point cloud based on the associated time of each point. Overall, it’s been a very stimulating summer and I’m excited to push this project even further!

Midterm: High Fidelity UAV Simulation Using Unreal Engine with specular reflections

Wed, 02 Aug 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to Develop a Unreal Engine based simulator for testing. The simulator will be using Unreal Engine for the physics and visualization.

What we have done so far

We found that we can use Unreal Engine as a physics simulator and co-simulate with Simulink using the tools provided by MathWorks.
Simulated a example provided by MathWorks but i wasn’t getting the expected behaviour and there were very few resource available.
So we decided with using Gazebo and ROS for simulation instead of Unreal Engine and Simulink for the example of a balancing bot which had been designed in Solidworks.
For using Gazebo, i had converted the Solidworks model into an URDF and imported it into Gazebo.

Future Work

Currently, i am working on using Gazebo and ROS for controling a balancing bot using a PID control algorithm. Afterwards document the process of import a model into Gazebo for testing a control algorithm.

Implemented IO uring for Key-Value Drives

Mon, 31 Jul 2023 00:00:00 +0000

Hi everyone!

I’m Manank Patel, (link to my Introduction post) and am currently working on Efficient Communication with Key/Value Storage Devices. The goal of the project was to leverage the capabilities of io_uring and implement a new backend driver.

In the existing sockets backend, we use non-blocking sockets with looping to ensure all the data is written. Here is a simplified flow diagram for the same. The reasoning behind using non blocking sockets and TCP_NODELAY is to get proper network utilization. This snippet from the code explains it further.


NODELAY means that segments are always sent as soon as possible,
even if there is only a small amount of data. When not set,
data is buffered until there is a sufficient amount to send out,
thereby avoiding the frequent sending of small packets, which
results in poor utilization of the network. This option is
overridden by TCP_CORK; however, setting this option forces
an explicit flush of pending output, even if TCP_CORK is
currently set.

In the above figure, we have a loop with a writev call, and we check the return value and if all the data has not been written, then we modify the offsets and then loop again, otherwise, if all the data has been written, we exit the loop and return from the function. Now this works well with traditional sockets, as we get the return value from the writev call as soon as it returns. In case of io_uring, if we try to follow the same design, we get the following flow diagram.

Here, as you can see, there are many additional steps/overhead if we want to check the return value before sending the next writev, as we need to know how many bytes has been written till now to change the offsets and issue the next request accordingly. Thus, in every iteration of the loop we need to to get an sqe, prep it for writev, then submit it, and then get a CQE, and then wait for the CQE to get the return value of writev call.

The alternate approach would be to write the full message/iovec atomically in one call, as shown in following diagram.

However, on trying this method, and running fio tests, we noticed that it worked well with smaller block sizes, like 16k, 32k and 64k, but was failing constantly with larger block sizes like 512k or 1m. This was because it was not able to write all the data to the socket in one go. This method showed good results as compared to sockets backend (for small BS i.e). We tried to increase the send/recv buffers to 1MiB-10MiB but it still struggled with larger blocksizes.

Going forward, we discussed a few ideas to understand the performance trade-offs. One is to use a static variable and increment it on every loop iteration, in this way we can find out if that is really the contirbuting factor to our problem. Another idea is to break down the message in small chunks, say 256k and and set up io uring with sqe polling and then link and submit those requests in loop, without calling io_uring_submit and waiting for CQE. The plan is to try these ideas, discuss and come up with new ideas on how we can leverage io_uring for ktli backend.

PolyPhy Infrastructure Enhancement

Thu, 27 Jul 2023 00:00:00 +0000

As part of the Polyphy Project, my proposal was aimed at improving various aspects of the project, including CI/CD workflows, encapsulation, and security. Under the mentorship of Oskar Elek, I have made significant progress in the following areas:

Fixed GitHub CI Workflows and Release to PyPI: During the first phase, I focused on refining the GitHub CI workflows by implementing new flows that facilitate seamless releases to PyPI. This ensures that the project can be easily distributed and installed by users, making it more accessible and user-friendly.
Encapsulation from Jupyter into Module: I successfully encapsulated the code from Jupyter notebooks into a module. This step is crucial as it prepares the codebase to be released as a standalone module, making it easier for developers to use and integrate into their own projects.
SonarCloud Integration for Better Code Analysis: To ensure the codebase’s quality, I set up SonarCloud to perform comprehensive code analysis. This helps in identifying potential issues, bugs, and areas of improvement, leading to a more robust and reliable project.
Migration to Docker from Tox: In order to improve the containerization process, I replaced the existing solution, Tox, with Docker. Docker provides better container management and ensures a consistent development and deployment environment across different platforms.
Research on Community Platforms for Self-Hosting: I conducted extensive research on various community platforms suitable for self-hosting. This will enable the project to establish a thriving community and foster active collaboration among users and contributors.
Enhanced Security Measures: I implemented several security improvements to safeguard the project and its users. These include setting up a comprehensive security policy, implementing secret scanning to prevent unintentional exposure of sensitive information, code scanning to identify potential vulnerabilities, private vulnerability reporting to handle security issues responsibly, and Dependabot integration for monitoring and managing dependencies.
Upgraded Taichi to Utilize Class-Based Features: As part of the project’s development, I successfully upgraded Taichi to utilize class-based features available, thereby enhancing the codebase’s organization and maintainability.

Moving forward, I plan to continue working diligently to achieve the goals outlined in my proposal. The improvements made during the first half of the GSoC program have laid a strong foundation for the project’s growth and success.

Stay tuned for further updates and exciting developments as the project progresses!

Proactive Data Containers

Tue, 20 Jun 2023 00:00:00 +0000

As part of the Proactive Data Containers (PDC) my proposal under the mentorship of Houjun Tang aims to novel data abstraction for managing science data in an object-oriented manner. PDC’s will provide efficient strategies for moving data in deep storage hierarchies and techniques for transforming and reorganizing data based on application requirements. The functionality of the container object themselves are already well developed, so my goal will be to verify the functionality tests regarding the Python API to ensure that it can be used with ease, as well as create command line tools so that it is a complete data object that can be used across platforms and is simple and helpful for the users.

Interactive Exploration of High-dimensional Datasets with PolyPhy and Polyglot

Fri, 16 Jun 2023 00:00:00 +0000

Hello! My name is Kiran and this summer I’ll be working with Polyphy and Polyglot under the mentorship of Oskar Elek. The full proposal is available online.

For a brief overview, the Polyglot app allows users to interact with a 3D network of high-dimensional language embeddings, specfically the Gensim Continuous Skipgram result of Wikipedia Dump of February 2017 (296630 words) dataset. The high-dimensional embeddings are reduced to 3 dimensions using UMAP. The novel MCPM slime mode metric is then used to compute the similarty levels between points (much like how you might compute the Euclidean distance between two points). These similarity levels are used to filter the network and enable users to find interesting patterns in their data they might not find using quantitative methods alone. For example, the network has a distinct branch in which only years are nearby! Users might find other clusters, such as ones with sports words or even software engineering words. Although such exploration may not lead to quantitatively significant conclusions, the ability to explore and test mini hypotheses about the data can lead to important insights that go on to incite quantitatively significant conclusions.

In our project, we aim to expand Polyglot such that any user can upload their own data, once they have computed the MCPM metric using PolyPhy. This will have important applications in building trust in our data and embeddings. This could also help with research on the MCPM metric, which presents a new, more naturalistic way of computing similarity by relying on the principle of least effort. Overall, there is an exciting summer ahead and if you’re interested in keeping up please feel free to check out the Polyglot app on Github!

Enhancing and Validating LiveHD's Power Modeling Flow

Mon, 29 May 2023 00:00:00 +0000

As part of the Enhancing and Validating LiveHD’s Power Modeling Flow my proposal under the mentorship of Jose Renau and Sakshi Garg aims to enhance and validate LiveHD’s power modeling flow, a critical feature for estimating power consumption in modern hardware designs. The existing flow requires further refinement to ensure its stability, accuracy, compatibility with a wider range of netlists and VCD files, and overall performance. To address these challenges, the project will focus on methodically debugging the current implementation, establishing a comprehensive validation methodology for verifying the accuracy of power estimates, and optimizing the flow to handle larger netlists and VCD files efficiently. Additionally, the project aims to improve existing documentation by providing detailed explanations, examples, and tutorials to facilitate user adoption and understanding. Upon successful completion, the project will deliver a more reliable, accurate, and efficient power modeling flow within LiveHD, contributing to the development of energy-efficient hardware designs. This refined flow will not only enhance the capabilities of LiveHD but also encourage wider adoption and utilization by the hardware design community, fostering innovation in the field of energy-efficient devices and systems.

High Fidelity UAV Simulation Using Unreal Engine with specular reflections

Mon, 29 May 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to Develop a unreal engine based simulator for testing. The simulator will be using unreal engine for the physics and visualization.

The existing framework uses gazebo simulator with ROS which limit the developement to only Python and C++ programing languages. I intend to develope this simulator with intention connecting it with Python and C++, additionaly expanding support to Matlab so that in future the control algorithm design and validation process becomes easier. To smoothen future developement, i intent to add detailed documentation consisting of the developement period weekly report, examples and tutorial. Upon succesful completion, the project will deliver a powerful simulator with realistic simulation using unreal engine and additional support other programming languages like matlab.

For more information about the Open Source Autonomous Vehicle Controller and the UC OSPO organization, you can visit the OSAVC project repository and the UC OSPO website.

OpenRAM Layout verses Schematic (LVS) visualization

Mon, 29 May 2023 00:00:00 +0000

As part of the OpenRAM Layout verses Schematic (LVS) visualization my proposal under the mentorship of Jesse Cirimelli-Low and Matthew Guthaus aims to develop a comprehensive Python-based graphical user interface (GUI) with a robust backend system to effectively analyze, visualize, and debug layout versus schematic (LVS) mismatches in the OpenRAM framework. The proposed solution focuses on efficiently processing LVS report files in JSON format, identifying mismatched nets in the layout, and visually representing extra nets in the schematic graph using advanced backend algorithms. By implementing a powerful backend system, the GUI will streamline the debugging process and improve overall productivity, while maintaining high performance and reliability. The deliverables for this project include a fully-functional GUI with a performant backend, features for visualizing and navigating through LVS mismatches, comprehensive documentation, and user guides.

Efficient Communication with Key/Value Storage Devices

Fri, 26 May 2023 00:00:00 +0000

Hi everyone!

I’m Manank Patel, and am currently an undergraduate student at Birla Institute of Technology and Sciences - Pilani, KK Birla Goa Campus. As part of the Efficient Communication with Key/Value Storage Devices my proposal under the mentorship of Aldrin Montana and Philip Kufeldt aims to implement io_uring based communication backend for network based key-value store.

io_uring offers a new kernel interface that can improve performance and avoid the overhead of system calls and zero copy network transmission capabilities. The KV store clients utilize traditional network sockets and POSIX APIs for their communication with the KV store. A notable advancement that has emerged in the past two years is the introduction of a new kernel interface known as io_uring, which can be utilized instead of the POSIX API. This fresh interface employs shared memory queues to facilitate communication between the kernel and user, enabling data transfer without the need for system calls and promoting zero copy transfer of data. By circumventing the overhead associated with system calls, this approach has the potential to enhance performance significantly.

Advancing Reproducible Science through Open Source Laboratory Protocols as Software

Thu, 25 May 2023 00:00:00 +0000

Hello everyone!

My name is Luiza, I am an eighth-semester Bsc Biological Sciences student from São Paulo, Brazil. As part of the LabOp working group, my proposal under the mentorship of Dan Bryce and Tim Fallon aims to build a conversor that takes normal laboratory protocols and translates them into machine executable protocols. This is possible thanks to LabOP’s versatility to represent what a Laboratory protocol should look like. I´ll be testing this specialization in Hamilton machines that are great for experimenting scalling up.

Nowadays we face a very common issue between Biotechnology laboratories, that is that protocols are difficult to share and to adapt for machine execution. Laboratory protocols are critical to biological research and development, yet complicated to communicate and reproduce across projects, investigators, and organizations. While many attempts have been made to address this challenge, there is currently no available protocol representation that is unambiguous enough for precise interpretation and automation, yet simultaneously abstract enough to enable reuse and adaptation.

With LabOP we can take a protocol and convert it in multiple ways depending on the needs of the researcher for automation or human experimentation and allowing flexibility for execution and experimentation so I`ll be building a specialization that translates protocols in a way that they can be executed by Hamilton machines.

PolyPhy Infrastructure Enhancement

Thu, 25 May 2023 00:00:00 +0000

Hey!

I’m Prashant Jha, from Pune, a recent undergraduate student from BITS Pilani. As part of the Polyphy my proposal under the mentorship of Oskar Elek aims to develop and improve the current infrastructure.

Polyphorm / PolyPhy - which is led by Oskar Elek. PolyPhy is an organization that focuses on developing a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. With its roots in astronomy and inspiration drawn from nature, PolyPhy has been instrumental in discovering network-like patterns in natural language data and reconstructing the Cosmic web structure using its early prototype called Polyphorm. The organization aims to provide a richer 2D / 3D scalar field representation of the reconstructed network, making it a toolkit for a range of specialists across different disciplines, including astronomers, neuroscientists, data scientists, and artists. PolyPhy’s ultimate purpose is to create quantitatively comparable structural analytics and discover connections between different disciplines. To achieve its goals, PolyPhy requires a robust infrastructure that is engineered using DevOps, Code Refactoring, and Continuous Integration/Continuous Deployment (CI/CD) practices. You can see an instructive overview of PolyPhy in our workshop and more details about our research here.

Strengthening Underserved Segments of the Open Source Pipeline

Thu, 25 May 2023 00:00:00 +0000

Namaste everyone🙏🏻!

I’m Nandini Saagar, from Mumbai. An undergraduate student at the Indian Institute of Technology, Banaras Hindu University, IIT (BHU), Varanasi. As part of the Strengthening Underserved Segments of the Open Source Pipeline my proposal under the mentorship of Emily Lovell aims to strengthen the underserved segment of the open source pipeline.

My interest in Open Source was first piqued as a freshman when I was introduced to Open Source as a place where people from all communities and backgrounds come together to create software that can have real-world impact, that too in a completely autonomous and self-governed manner! I am so glad that I could transition from just a person who imagined Open Source to be a fair-eyed dream to being a part of multiple such communities. This journey has been life-defining for me, and that’s why I want to help deliver the message of Open Source to all teenagers!

This project seeks to invite and support broader, more diverse participation in open source by supporting early contributors, especially those who have been historically minoritized within tech. It will aim to create content that anyone with some Open Source experience can use to help and guide new students to the journey of OpenSource, GitHub, and all the relevant technologies, provide a medium and platform for all contributors to share their various OpenSource experiences and testimonials, conduct an Open Source Themed Hackathon/Scavenger Hunt, and leverage the power of social media engagement to get young and brilliant minds acquainted with the technical and open-source world at an early age.

Stay tuned to explore the enormous world of Open Source with me!