scalability | UCSC OSPO

[Final] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Wed, 18 Sep 2024 00:00:00 +0000

Hello everyone,

In my SoR 2024 project, ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. I’m excited to share the final progress and insights we’ve gathered on tackling scalability bugs in large-scale distributed systems. I aimed to tackle the reproducibility challenges posed by scalability bugs in large-scale distributed systems. Below is a detailed summary of the investigations and findings we’ve conducted on scalability bugs in large-scale distributed systems.

Project Overview

As you may recall, our project, ScaleRep, aimed to tackle the challenge of scalability bugs—those insidious issues that often arise in large-scale distributed systems under heavy workloads. These bugs, when triggered, can lead to significant system issues such as downtime, performance bottlenecks, and even data loss. They are particularly difficult to catch using traditional testing methods.

Our primary focus was on reproducing these bugs, documenting the challenges involved, and providing insights into how these bugs manifest under various conditions. This documentation will help researchers identify, benchmark, and resolve similar issues in the future.

Progress

Since the midterm update, several Apache Ignite bugs have been investigated, some of which have been successfully reproduced and uploaded to Trovi for the research community to access and reuse. Below is the progress on the bugs investigated:

Bugs Investigated

Key Insights & Challenges

Complexity of Scalability Bugs Many scalability bugs involve subtle and complex interactions that are not easily detected in standard testing environments. For instance, IGNITE-20602 only manifested under certain high-load conditions and required a specific workload and environment to reliably trigger the issue. This highlights the importance of large-scale testing when investigating scalability issues.
Dependency and Documentation Gaps We encountered significant challenges with outdated dependencies and incomplete documentation, particularly in older bugs like IGNITE-16072. In these cases, reproducing the bug required extensive modifications or wasn’t feasible without investing disproportionate effort in updating dependencies.
Effectiveness of Trovi and Chameleon Packaging and sharing our reproducible investigations through Trovi and Chameleon have proven highly effective. By providing researchers with pre-configured environments and detailed documentation, we’ve laid the groundwork for future collaboration and further research on these bugs. We expect this to greatly benefit others attempting to reproduce similar issues.
Impact of Speed-Based Throttling Our investigation into IGNITE-16600 revealed several important insights into speed-based throttling and its impact on system performance under high-load conditions. By analyzing the checkpoint starvation and thread throttling mechanisms, we were able to identify areas for improvement in the latest Ignite releases.

Next Steps

Expanding Collaboration: The packaged bugs and replayable Trovi experiments will be made available to the broader research community, encouraging further investigation and enhancements to large-scale distributed systems.

The ScaleRep project has been an exciting journey into the world of scalability bugs, pushing the boundaries of what’s possible in terms of reproducibility and benchmarking. Through this project, we’ve demonstrated the importance of rigorous testing and comprehensive documentation in improving the reliability of distributed systems.

[MidTerm] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Thu, 01 Aug 2024 00:00:00 +0000

Hey there, scalability enthusiasts and fellow researchers! I’m excited to share my progress on the ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. Here’s a glimpse into how we’re tackling scalability bugs in large-scale distributed systems.

Project Overview

Large-scale distributed systems are the backbone of modern computing, powering various applications and services. However, these systems often face challenges related to reliability and performance, particularly scalability bugs. These bugs manifest in large-scale deployments, causing issues such as system downtime, reduced responsiveness, and data loss. Traditional bug-finding methods fall short in detecting these bugs, which are triggered by factors like component count, system load, workload size, recovery protocol reliability, and intermediate failure magnitude.

Our project, ScaleRep, aims to address these challenges by analyzing recent scalability issues from ten popular open-source large-scale systems. We are providing detailed accounts of bug reproduction experiences, identifying common challenges, and developing protocols for triggering and quantifying the impact of scalability bugs.

Progress Highlights

So far, I have been working on the following bugs and have successfully uploaded some of them to Trovi. Here’s a brief overview of my progress:

Bugs Worked On:

IGNITE-20614: Uploaded to Trovi Trovi Link
IGNITE-17407: Uploaded to Trovi Trovi Link
IGNITE-20692
IGNITE-16600
IGNITE-16072

What is Chameleon and Trovi?

Chameleon is a configurable experimental environment for large-scale cloud research. It provides a platform for running and testing distributed systems at scale, allowing researchers to reproduce and study scalability issues in a controlled setting.

Trovi is a platform that facilitates the sharing of reproducible artifacts. By uploading our bug reproduction artifacts to Trovi, we enable other researchers to easily reproduce scalability bugs, fostering collaboration and advancing the field of distributed systems research.

Short Description of the Bugs

IGNITE-20614 This bug refers to an issue where the Ignite service grid experiences degradation or hangs under specific conditions related to service deployment and node restarts.

Root Causes: The root cause is a race condition during the deployment and undeployment of services in the service grid, particularly when nodes are restarted or when there is a significant amount of concurrent service deployment and undeployment activity.

Impact: The impact of this bug includes potential service grid hangs, degraded performance, and possible inability to deploy or undeploy services as expected, which can disrupt the overall operation of the Ignite cluster.

Fix: The fix involves adding proper synchronization mechanisms to handle concurrent service deployment and undeployment operations more gracefully, ensuring that race conditions are avoided.

IGNITE-17407 This issue pertains to the incorrect behavior of the Ignite thin client protocol, particularly when dealing with binary objects and schema changes.

Root Causes: The root cause lies in the way the thin client handles binary object schema changes. The thin client was not correctly updating the schema cache, leading to inconsistencies and incorrect behavior when deserializing binary objects.

Impact: Users of the thin client may experience issues with binary object deserialization, leading to potential data corruption, incorrect query results, and overall application instability.

Fix: The fix involves updating the thin client protocol to properly handle schema changes by ensuring that the schema cache is correctly updated and synchronized with the server.

IGNITE-20692 This bug is related to the performance degradation observed in the Ignite SQL engine when executing certain complex queries.

Root Causes: The root cause is identified as inefficient query planning and execution strategies for specific types of complex SQL queries, leading to excessive resource consumption and slow query performance.

Impact: Users running complex SQL queries may experience significant performance degradation, leading to slower response times, increased CPU and memory usage, and potentially impacting the overall performance of the Ignite cluster.

Fix: The fix involves optimizing the SQL query planner and executor to handle complex queries more efficiently, including better indexing strategies, improved query plan caching, and more effective resource management during query execution.

IGNITE-16600 This bug involves an issue with speed-based throttling in the checkpoint process, leading to possible starvation of the checkpoint thread under heavy load.

Root Causes: The root cause is the absence of proper mechanisms to wake up throttled threads when they no longer need to be throttled, resulting in unnecessary waiting and potential starvation of the checkpoint thread.

Impact: Under heavy load, the checkpoint process can be significantly delayed, leading to slower checkpoint completion times, increased risk of data loss, and overall degraded performance of the Ignite cluster.

Fix: The fix includes implementing methods to wake up throttled threads when they no longer need to be throttled (tryWakeupThrottledThreads and shouldThrottle), ensuring that the checkpoint process can proceed without unnecessary delays.

IGNITE-16072 This issue pertains to the incorrect handling of SQL queries involving NULL values in the Ignite SQL engine, leading to unexpected query results.

Root Causes: The root cause is an incorrect implementation of SQL semantics for handling NULL values in certain query conditions, particularly in the presence of complex joins and subqueries.

Impact: Users may experience incorrect query results when NULL values are involved, leading to potential data inconsistencies and incorrect application behavior.

Fix: The fix involves correcting the SQL engine’s implementation to properly handle NULL values according to the SQL standard, ensuring that queries involving NULL values produce the expected results.

What’s Next?

Continued Bug Reproduction:

Focus on reproducing more scalability bugs

Documentation of Challenges:

Breakdown specific challenges encountered during attempts to reproduce scalability bugs.
Categorize challenges, including technical complexities, environmental dependencies, and lack of documentation in bug reports.

Finalizing Project Deliverables:

Package artifacts using Jupyter notebook scripts for convenient replay of investigation steps.
Upload the package to Trovi for replayable artifacts, enabling other researchers to easily reproduce scalability bugs for our benchmark applications.

Conclusion

The ScaleRep project has made significant strides in reproducing and benchmarking scalability bugs in large-scale distributed systems. By successfully reproducing and documenting scalability bugs, we are contributing valuable insights to the research community, aiding in the development of more robust distributed systems. The protocols and methodologies devised in this project will serve as valuable tools for researchers exploring similar issues.

Stay tuned for more updates as we continue to tackle scalability bugs and improve the reliability and performance of large-scale distributed systems.

ScaleBugs: Reproducible Scalability Bugs

Tue, 07 Feb 2023 00:00:00 +0000

Scalable systems lay essential foundations of the modern information industry. HPC data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs. As its name suggests, scalability bugs may be presented depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. For example, Cassandra-6127 is a scalability bug detected in the popular distributed database Cassandra. The scalability bug causes unnecessary CPU usage, however, the symptom is not observed unless ~1000 nodes are deployed. This demonstrates the main challenge of studying scalability bugs: it is extremely challenging to reproduce without deploying the system at a large scale.

In this project, our goal is to build a dataset of reproducible scalability bugs. To achieve this, we will go through the existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. For each bug report, we determine if the reported bug depends on the scale of the run, such as the number of nodes utilized. With the collected scale-dependent bugs, we then will craft the workload to reproduce those scalability bugs. Our workloads will be designed to trigger some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance. For example, a successful reproduction should be able to show the performance drop along with an increasing number of nodes.

Building a Dataset of Reproducible Scalability Bugs

Topics: Scalability systems, bug patterns, reproducibility, bug dataset
Skills: Linux Shell, Docker, Java, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Cindy Rubio González, Haryadi S. Gunawi, Hao-Nan Zhu
Contributor(s): Goodness Ayinmode, Zahra Nabila Maharani

The student will build a dataset of reproducible scalability bugs. Each bug artifact in the dataset will contain (1) the buggy and fixed versions of the scalability system, (2) a runtime environment that ensures reproducibility, and (3) a workload shell script that could demonstrate the symptoms of the bug under different scales.

Specific Tasks

Work with the mentors to understand the context of the project.
Learn the background of scalability systems.
Inspect the bug reports from Apache JIRA and identify scale-dependent bugs.
Craft shell scripts to trigger the exact scalability bug described by the bug report.
Organize the reproducible scalability bugs and write documentation to build the code and trigger the bug.