load testing | UCSC OSPO

Final Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Tue, 12 Nov 2024 00:00:00 +0000

Hi, I am Mrigank. As a Summer of Reproducibility 2024 fellow, I have been working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah. In particular, we want to benchmark Meta’s Static Python interpreter (which is a part of their Cinder project) and compare its performance with CPython on different levels of typing. In this post, I will share updates on my work since my last update. This post forms my final report for the Summer of Reproducibility 2024.

Since Last Time: Typing Django Files

Based on the profiling results from load testing a Wagtail blog site, I identified three modules in Django that were performance bottlenecks and added shallow types to them. These are available on our GitHub repository.

I also wrote a script to mix untyped, shallow-typed, and advanced-typed versions of a Python module and create a series of such gradually typed versions.

Summary of Experience and Contributions

I tried to set up different versions of Zulip to make them work with Static Python. My setup scripts are available in our repository. Unfortunately, Zulip’s Zerver did not run with Static Python due to incompatibility of some Django modules. A few non-Django modules were also initially throwing errors when run with Static Python due to a bug in Cinder – but I was able to get around with a hack (which I have described in the linked GitHub issue I opened on Cinder’s repository).
I created a locust-version of the small Django-related benchmarks available in pyperperformance and skybison. This helped me confirm that Django is by itself compatible with Static Python, and helped me get started with Locust. This too is available in our repository.
As described in the midterm report, I created a complete pipeline with Locust to simulate real-world load on a Wagtail blog site. The instructions and scripts for running these load tests as well as profiling the Django codebase are available (like everything else!) in our repository.
We added shallow types to the three Django modules mentioned above, and I created scripts to mix untyped, shallow-typed, and advanced-typed versions of a Python module to create a series of gradually typed versions to be tested for performance. We found that advanced-typed code may often be structurally incompatible with shallow-typed code and are looking for a solution for this. We are tracking some examples of this in a GitHub issue.

Going Forward

I had a great time exploring Static Python, typing in Python, load testing, and all other aspects of this project. I was also fortunate to have a helpful mentor along with other amazing team members in the group. During this project, we hit several roadblocks like the challenges in setting up real-world applications with Static Python and the difficulty in adding advanced types – but are managing to work around them. I will be continuing to work on this project until we have a complete set of benchmarks and a comprehensive report on the performance of Static Python.

Our work will continue to be open-sourced and available on our GitHub repository for anyone interested in following along or contributing.

Deriving Realistic Performance Benchmarks for Python Interpreters

Sat, 17 Aug 2024 00:00:00 +0000

Hi, I am Mrigank. I am one of the Summer of Reproducibility fellows for 2024, and I will be working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah.

Background and Motivation

Recent work by Meta on a statically typed variant of Python – Static Python – which has provided immense promise in moving towards gradually typed languages without compromising on performance due to lack of complete soundness. Lu et al.¹ provide an evaluation of Static Python and conclude that the enhancement in performance reported by Meta on their web servers for Instagram is reasonable and is not just the result of refactoring. In fact, the study notes that very little refactoring is typically required for converting existing Python programs to Static Python. However, this study depends on a limited model of the language and does not represent real-world software applications.

In our project, we aim to create a realistic performance benchmark to reproduce performance improvements reported by Meta and to evaluate the performance of Static Python in real-world software applications. In addition, we will analyze partially-typed code to understand the performance implications of gradual typing in Python.

Key Objectives

We will use widely-used open-sourced applications to derive realistic performance benchmarks for evaluating Static Python. In particular, we will focus on projects that utilize the Python framework Django, which is also known to power the backend of Instagram. We plan to begin with Wagtail, a popular CMS built on Django. We have also identified other potential projects like Zulip, Plane and LibrePhotos. These are all actively maintained projects with significantly large codebases.

Further, we will analyze the performance of partially-typed code. This will be of value to the Python community as it will provide confidence in gradually moving towards Static Python for improving performance. We will make our benchmarks publicly available for the community to use, reproduce, and extend.

Methodology

Load Testing

For each project that we derive benchmarks from, we will design user pipelines that simulate real-world usage and implement them to create load tests using the open-sourced Locust framework. This will allow us to evaluate the performance of Static Python in real-world loads and scenarios. Locust can spawn thousands of users, each of which independently bombards the system with HTTP requests for a range of tasks that are defined in their user pipeline. We will host each project on a server (local or cloud) to run these load tests.

We will profile each project to ensure that our tests cover different parts of the codebase and to identify performance bottlenecks. We can then focus on these bottlenecks while gradually typing the codebase.

Gradual Typing

For typing the code in these projects, we will create two versions of each project: one with the so-called “shallow” type annotations and another with “advanced” type annotations. The former is relatively easier to implement and we can use tools like MonkeyType to generate stubs that can be quickly verified manually. The latter is quite non-trivial and will require manual effort. We will then mix-and-match the three versions of each project to create different combinations of typed and untyped code. Note that this mix-and-match can be done at both the module level and also at the function or class level.

Conclusion

This is my first time working on performance-benchmarking and I am excited to pick up new skills in the process. I am also looking forward to interacting with people from the Python community, people from Meta’s Static Python team, and also with the maintainers of the projects we will be working on. I will be posting more updates on this project as we make progress. Stay tuned!

Kuang-Chen Lu, Ben Greenman, Carl Meyer, Dino Viehland, Aniket Panse, and Shriram Krishnamurthi. Gradual soundness: Lessons from static python. The Art, Science, and Engineering of Programming. ↩︎

Midterm Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Sat, 17 Aug 2024 00:00:00 +0000

Hi, I am Mrigank. As a Summer of Reproducibility 2024 fellow, I am working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah. In this post, I will provide an update on the progress we have made so far.

Creating a Performance Benchmark

We are currently focusing on applications built on top of Django, a widely used Python web framework. For our first benchmark, we chose Wagtail, a popular content management system. We created a pipeline with locust to simulate real-world load on the application. All of our work is open-sourced and available on our GitHub repository.

This load-testing pipeline creates hundreds of users who independently create many blog posts on a Wagtail blog site. At the same time, thousands of users are spawned to view these blog posts. Wagtail does not have a built-in API and so it took some initial effort to figure out the endpoints to hit, which I did by inspecting the network logs in the browser while interacting with the Wagtail admin interface.

A snapshot from a run of the load test with Locust is shown in the featured image above. This snapshot was generated by spawning users from 24 different parallel locust processes. This was done on a local server, and we plan to perform the same experiments on CloudLab soon.

Profiling

On running the load tests with a profiler, we found that the bottlenecks in the performance arose not from the Wagtail codebase but from the Django codebase. In particular, we identified three modules in Django that consumed the most time during the load tests: django.db.backends.sqlite3._functions, django.utils.functional, and django.views.debug. Dibri, a graduate student in Ben’s lab, is helping us add types to these modules.

Next Steps

Based on these findings, we are now working on typing these modules to see if we can improve the performance of the application by using Static Python. Typing Django is a non-trivial task, and while there have been some efforts to do so, previous attempts like django-stubs are incomplete for our purpose.

We are also writing scripts to mix untyped, shallow-typed, and advanced-typed versions of a Python file, and run each mixed version several times to obtain a narrow confidence interval for the performance of each version.

We will be posting more updates as we make progress. Thank you for reading!