Zahra Nabila Maharani | UCSC OSPO

[Final] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Wed, 18 Sep 2024 00:00:00 +0000

Hello everyone,

In my SoR 2024 project, ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. I’m excited to share the final progress and insights we’ve gathered on tackling scalability bugs in large-scale distributed systems. I aimed to tackle the reproducibility challenges posed by scalability bugs in large-scale distributed systems. Below is a detailed summary of the investigations and findings we’ve conducted on scalability bugs in large-scale distributed systems.

Project Overview

As you may recall, our project, ScaleRep, aimed to tackle the challenge of scalability bugs—those insidious issues that often arise in large-scale distributed systems under heavy workloads. These bugs, when triggered, can lead to significant system issues such as downtime, performance bottlenecks, and even data loss. They are particularly difficult to catch using traditional testing methods.

Our primary focus was on reproducing these bugs, documenting the challenges involved, and providing insights into how these bugs manifest under various conditions. This documentation will help researchers identify, benchmark, and resolve similar issues in the future.

Progress

Since the midterm update, several Apache Ignite bugs have been investigated, some of which have been successfully reproduced and uploaded to Trovi for the research community to access and reuse. Below is the progress on the bugs investigated:

Bugs Investigated

Key Insights & Challenges

Complexity of Scalability Bugs Many scalability bugs involve subtle and complex interactions that are not easily detected in standard testing environments. For instance, IGNITE-20602 only manifested under certain high-load conditions and required a specific workload and environment to reliably trigger the issue. This highlights the importance of large-scale testing when investigating scalability issues.
Dependency and Documentation Gaps We encountered significant challenges with outdated dependencies and incomplete documentation, particularly in older bugs like IGNITE-16072. In these cases, reproducing the bug required extensive modifications or wasn’t feasible without investing disproportionate effort in updating dependencies.
Effectiveness of Trovi and Chameleon Packaging and sharing our reproducible investigations through Trovi and Chameleon have proven highly effective. By providing researchers with pre-configured environments and detailed documentation, we’ve laid the groundwork for future collaboration and further research on these bugs. We expect this to greatly benefit others attempting to reproduce similar issues.
Impact of Speed-Based Throttling Our investigation into IGNITE-16600 revealed several important insights into speed-based throttling and its impact on system performance under high-load conditions. By analyzing the checkpoint starvation and thread throttling mechanisms, we were able to identify areas for improvement in the latest Ignite releases.

Next Steps

Expanding Collaboration: The packaged bugs and replayable Trovi experiments will be made available to the broader research community, encouraging further investigation and enhancements to large-scale distributed systems.

The ScaleRep project has been an exciting journey into the world of scalability bugs, pushing the boundaries of what’s possible in terms of reproducibility and benchmarking. Through this project, we’ve demonstrated the importance of rigorous testing and comprehensive documentation in improving the reliability of distributed systems.

[MidTerm] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Thu, 01 Aug 2024 00:00:00 +0000

Hey there, scalability enthusiasts and fellow researchers! I’m excited to share my progress on the ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. Here’s a glimpse into how we’re tackling scalability bugs in large-scale distributed systems.

Project Overview

Large-scale distributed systems are the backbone of modern computing, powering various applications and services. However, these systems often face challenges related to reliability and performance, particularly scalability bugs. These bugs manifest in large-scale deployments, causing issues such as system downtime, reduced responsiveness, and data loss. Traditional bug-finding methods fall short in detecting these bugs, which are triggered by factors like component count, system load, workload size, recovery protocol reliability, and intermediate failure magnitude.

Our project, ScaleRep, aims to address these challenges by analyzing recent scalability issues from ten popular open-source large-scale systems. We are providing detailed accounts of bug reproduction experiences, identifying common challenges, and developing protocols for triggering and quantifying the impact of scalability bugs.

Progress Highlights

So far, I have been working on the following bugs and have successfully uploaded some of them to Trovi. Here’s a brief overview of my progress:

Bugs Worked On:

IGNITE-20614: Uploaded to Trovi Trovi Link
IGNITE-17407: Uploaded to Trovi Trovi Link
IGNITE-20692
IGNITE-16600
IGNITE-16072

What is Chameleon and Trovi?

Chameleon is a configurable experimental environment for large-scale cloud research. It provides a platform for running and testing distributed systems at scale, allowing researchers to reproduce and study scalability issues in a controlled setting.

Trovi is a platform that facilitates the sharing of reproducible artifacts. By uploading our bug reproduction artifacts to Trovi, we enable other researchers to easily reproduce scalability bugs, fostering collaboration and advancing the field of distributed systems research.

Short Description of the Bugs

IGNITE-20614 This bug refers to an issue where the Ignite service grid experiences degradation or hangs under specific conditions related to service deployment and node restarts.

Root Causes: The root cause is a race condition during the deployment and undeployment of services in the service grid, particularly when nodes are restarted or when there is a significant amount of concurrent service deployment and undeployment activity.

Impact: The impact of this bug includes potential service grid hangs, degraded performance, and possible inability to deploy or undeploy services as expected, which can disrupt the overall operation of the Ignite cluster.

Fix: The fix involves adding proper synchronization mechanisms to handle concurrent service deployment and undeployment operations more gracefully, ensuring that race conditions are avoided.

IGNITE-17407 This issue pertains to the incorrect behavior of the Ignite thin client protocol, particularly when dealing with binary objects and schema changes.

Root Causes: The root cause lies in the way the thin client handles binary object schema changes. The thin client was not correctly updating the schema cache, leading to inconsistencies and incorrect behavior when deserializing binary objects.

Impact: Users of the thin client may experience issues with binary object deserialization, leading to potential data corruption, incorrect query results, and overall application instability.

Fix: The fix involves updating the thin client protocol to properly handle schema changes by ensuring that the schema cache is correctly updated and synchronized with the server.

IGNITE-20692 This bug is related to the performance degradation observed in the Ignite SQL engine when executing certain complex queries.

Root Causes: The root cause is identified as inefficient query planning and execution strategies for specific types of complex SQL queries, leading to excessive resource consumption and slow query performance.

Impact: Users running complex SQL queries may experience significant performance degradation, leading to slower response times, increased CPU and memory usage, and potentially impacting the overall performance of the Ignite cluster.

Fix: The fix involves optimizing the SQL query planner and executor to handle complex queries more efficiently, including better indexing strategies, improved query plan caching, and more effective resource management during query execution.

IGNITE-16600 This bug involves an issue with speed-based throttling in the checkpoint process, leading to possible starvation of the checkpoint thread under heavy load.

Root Causes: The root cause is the absence of proper mechanisms to wake up throttled threads when they no longer need to be throttled, resulting in unnecessary waiting and potential starvation of the checkpoint thread.

Impact: Under heavy load, the checkpoint process can be significantly delayed, leading to slower checkpoint completion times, increased risk of data loss, and overall degraded performance of the Ignite cluster.

Fix: The fix includes implementing methods to wake up throttled threads when they no longer need to be throttled (tryWakeupThrottledThreads and shouldThrottle), ensuring that the checkpoint process can proceed without unnecessary delays.

IGNITE-16072 This issue pertains to the incorrect handling of SQL queries involving NULL values in the Ignite SQL engine, leading to unexpected query results.

Root Causes: The root cause is an incorrect implementation of SQL semantics for handling NULL values in certain query conditions, particularly in the presence of complex joins and subqueries.

Impact: Users may experience incorrect query results when NULL values are involved, leading to potential data inconsistencies and incorrect application behavior.

Fix: The fix involves correcting the SQL engine’s implementation to properly handle NULL values according to the SQL standard, ensuring that queries involving NULL values produce the expected results.

What’s Next?

Continued Bug Reproduction:

Focus on reproducing more scalability bugs

Documentation of Challenges:

Breakdown specific challenges encountered during attempts to reproduce scalability bugs.
Categorize challenges, including technical complexities, environmental dependencies, and lack of documentation in bug reports.

Finalizing Project Deliverables:

Package artifacts using Jupyter notebook scripts for convenient replay of investigation steps.
Upload the package to Trovi for replayable artifacts, enabling other researchers to easily reproduce scalability bugs for our benchmark applications.

Conclusion

The ScaleRep project has made significant strides in reproducing and benchmarking scalability bugs in large-scale distributed systems. By successfully reproducing and documenting scalability bugs, we are contributing valuable insights to the research community, aiding in the development of more robust distributed systems. The protocols and methodologies devised in this project will serve as valuable tools for researchers exploring similar issues.

Stay tuned for more updates as we continue to tackle scalability bugs and improve the reliability and performance of large-scale distributed systems.

ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Sat, 08 Jun 2024 00:00:00 +0000

Hi! I’m Zahra, an undergraduate at Universitas Dian Nuswantoro, Indonesia. As part of the ScaleRep my proposal under the mentorship of Bogdan "Bo" Stoica and Yang Wang aims to systematically understand, characterize, and document the challenges associated with scalability bugs in large-scale distributed systems.

ScaleRep proposes a two-fold strategy to address scalability bugs in large-scale distributed systems. First, Bug Analysis and Documentation involves studying recent scalability issues across popular open-source systems such as Cassandra, Hadoop, HDFS, Ignite, and Spark to understand bug causes, symptoms, and solutions. This includes pinpointing common challenges hindering bug reproduction and devising protocols to trigger and measure scalability bug impacts. Second, Implementation and Artifact Packaging focuses on identifying, reproducing, and documenting scalability bugs, then packaging artifacts with Chameleon Trovi. This method emphasizes precise bug analysis, establishing reproducible environments, and detailed documentation to ensure artifact reliability and usability.

ScaleBugs: Reproducible Scalability Bugs

Wed, 02 Aug 2023 00:00:00 +0000

Introduction

As part of the Scalebugs Project, we have worked on building a dataset of reproducible scalability bugs. To achieve this, we go through existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. Workloads are designed to reproduce these scalability bugs by triggering some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance.

So far we have worked on packaging the buggy and fixed versions of scalability systems, a runtime environment that ensures reproducibility, and the workloads used to trigger the symptoms of the bug inside docker containers. By packaging these versions together, we are simplifying the process of deployment and testing. This enables us to switch between different versions efficiently, aiding in the identification and comparison of the bug’s behavior. For each scalability system, we have carefully built a runtime environment that is consistent and reproducible. This approach ensures that each time we run tests or investigations, the conditions remain identical.

New Terms

In order to make sense of the various bug reports, we had to learn some terminologies associated with scalability systems:

Clusters: Clusters are groups of related or connected items, often found in various fields such as computer science, data analysis, or even social sciences. For example, in data analysis, clusters might represent groups of data points with similar characteristics, making it easier to understand patterns or trends in the data.

Cluster Membership: Cluster membership refers to the process of determining which items or entities belong to a particular cluster. This task can be done based on various criteria, such as similarity in attributes, spatial proximity, or shared characteristics.

Locks: In computer programming, locks are mechanisms used to manage access to shared resources, such as files, data structures, or hardware devices. When multiple processes or threads need to access a shared resource simultaneously, locks ensure that only one process or thread can access it at a time, preventing data corruption or conflicts.

Lock Contentions: Lock contention occurs when multiple processes or threads attempt to acquire the same lock simultaneously. When this happens, one process or thread must wait until the lock becomes available, leading to potential delays and reduced performance.

Critical Paths: In project management or process analysis, a critical path is the longest chain of dependent tasks that determines the overall duration of the project or process. Any delay in tasks along the critical path will directly impact the project’s completion time.

Tokens: Tokens can have various meanings depending on the context. In computer programming, tokens are the smallest units of source code recognized by a compiler or interpreter. In cryptography, tokens can represent digital certificates or authentication data used for secure communication.

Nodes: In the context of network theory or graph theory, nodes are individual points or entities that form a network or graph. In a computer network, nodes can be devices like computers or routers, and in a social network, nodes can represent individuals or entities.

Peers: Peers are entities within a network that have the same status or capabilities. In peer-to-peer networks, each node can act as both a client and a server, enabling direct communication between nodes without relying on a central server.

Gossipers, Gossip Protocol: In distributed systems, gossipers are nodes that share information with each other using the gossip protocol. The gossip protocol involves randomly selecting peers and exchanging information in a decentralized manner, allowing information to spread quickly across the network.

Threads: Threads are the smallest units of execution within a process in computer programming. Multiple threads can run concurrently within a single process, enabling multitasking and parallel processing. Threads can share the same resources within the process, making them more lightweight than separate processes. However, proper synchronization is essential to prevent data corruption or conflicts when multiple threads access shared resources.

Flush and Writes Contention: This refers to a situation where simultaneous operations involving data flushing (saving data to a storage medium) and data writing (updating or adding data) are causing conflicts or delays. This contention can arise when multiple processes or threads attempt to perform these operations concurrently, leading to performance bottlenecks or potential data integrity issues.

Accomplishments

We have been able to build docker containers for the following scalability bugs:

IGNITE 12087

This bug stems from the resolution of the IGNITE-5227 issue (another bug), which has led to a significant decline in the performance of a particular operation. Prior to addressing IGNITE-5227, the insertion of 30,000 entries displayed remarkable efficiency, completing in roughly 1 second. However, post the resolution, executing the same insertion process for 30,000 entries witnessed a considerable slowdown, taking approximately 130 seconds – a performance degradation of nearly 100 times.

CASSANDRA 14660

This bug is related to how clusters work together and how a lock is causing conflicts with the critical path. The issue arises from a method call that uses O(Peers * Tokens) resources while contending for a lock, which is causing problems in the write path. The lock is used to protect cached tokens that are essential for determining the correct replicas. The lock is implemented as a synchronized block in the TokenMetadata class.

How was this fixed?

It was fixed by reducing the complexity of the operation to O(Peers) taking advantage of some properties of the token list and the data structure.

CASSANDRA 12281

This bug is also related to how clusters work together and a lock conflict. The issue arises when a specific method is trying to access a lot of resources (O(Tokens^2)) while contending for a read lock. As reported, a cluster with around 300 nodes has around 300 * 256 (assuming the default number of tokens) tokens, thus joining a new member reportedly is taking more than 30 mins. This happens because due to the long execution time here, this lock makes every gossip message delayed, so the node never becomes active.

How was this fixed?

The granularity of the lock is decreased, meaning that the expensive function calls now do not take the problematic read lock and simply use a synchronized block, synchronizing on a specific field, that does the job much better.

HA16850

This is a bug related to obtaining thread information in the JvmMetrics package. When obtaining thread information, the original buggy version used MXBeans to obtain thread information. The call uses an underlying native implementation that holds a lock on threads, preventing thread termination or creation. This means that the more threads that we have to obtain information for, the longer the function call will hold a lock. The result is that the execution time scales on the number of active threads O(threads).

How was this fixed?

Developers utilized a ThreadGroup to keep track of obtaining metrics for threads. The result is that there is no lock held for every thread.

CA13923

This issue revolves around conflicts between the “flush” and “writes” processes. The main problem is that during the “flush” process, a resource-intensive function called “getAddressRanges” is invoked. This function has a high computational cost and its complexity is O(Tokens^2). In other words, the time it takes to complete this function grows quickly as the number of “tokens” increases. This situation is causing challenges and delays in the overall process.

How was this fixed?

This function call affected many paths and they made sure no one calls getAddressRanges in critical paths.

Challenges

Demanding Memory Requirements: Running certain builds consumes a significant amount of memory. This places a strain on system resources and can impact the overall performance and stability of the process.

Little Issues Impacting Execution: Often, seemingly minor details can obstruct the successful execution of a build. Resolving such issues requires thorough investigation and extensive research into similar problems faced by others in the past.

Complexities of Scalability Bugs: Identifying the underlying causes of scalability-related bugs is intricate. These bugs exhibit unique characteristics that can complicate the process of pinpointing and comprehending their root origins.

What is Docker? ( For those who don’t know about it )

Docker is a platform that facilitates the containerization of applications, leading to consistent and efficient deployment across diverse environments. Its benefits include portability, resource efficiency, isolation, and rapid development cycles. DockerHub complements Docker by providing a centralized hub for sharing and accessing container images, fostering collaboration and ease of use within the Docker ecosystem.

More about docker https://docs.docker.com/get-started/overview/

ScaleBugs: Reproducible Scalability Bugs

Thu, 01 Jun 2023 00:00:00 +0000

Hello! As part of the ScaleBugs project our proposals (proposal from Goodness Ayinmode and proposal from Zahra Nabila Maharani) under the mentorship under the mentorship of Cindy Rubio González,Haryadi S. Gunawi and Hao-Nan Zhu aims to build a dataset of reproducible scalability bugs by analyzing bug reports from popular distributed systems like Cassandra, HDFS, Ignite, and Kafka. For each bug report, we will analyze whether the reported bug is influenced by the scale of the operation, such as the number of nodes being used or a number of requests. The resulting dataset will consist of bug artifacts containing the buggy and fixed versions of the scalability system, a reproducible runtime environment, and workload shell scripts designed to demonstrate bug symptoms under different scales. These resources will help support research and development efforts in addressing scalability issues and optimizing system performance.