Type Narrowing: A Language Design Benchmark

Sat, 01 Feb 2025 00:00:00 +0000

Untyped languages such as JavaScript and Python provide a flexible starting point for software projects, but eventually, the lack of reliable types makes code hard to debug and maintain. Gradually typed languages such as TypeScript, Flow, Mypy, and Pyright address the problem with type checkers that can reason about an ever-growing subset of untyped code. Widening the subset with precise types is an ongoing challenge.

Furthermore, designs for precise gradual types need to be reproducible across languages. Ideas that works well in one language need to be validated in other contexts in a principled, scientific way to separate deep insights from language-specific hacks.

Type narrowing is a key feature of gradual languages. Narrowing uses type tests in code to refine types and push information forward along the paths that the program may follow. For example, when a type test checks an object field, later code can trust the type of the field:

// item :: JSON Object
if typeof(item["price"] == "number"):
 // item :: JSON Object,
 // where field "price" :: Number
 return item["price"] + (item["price"] * 0.30) // add tax

Nearly every gradual language agrees that some form of type narrowing is needed, but there is widespread disagreement about how much support is enough. TypeScript lets users define custom type tests, but it does not analyze those tests to see whether they are reliable. Flow does analyze tests. TypeScript does not allow asymmetric type tests (example: is_even_number), but Flow, Mypy and Pyright all do! None of the above track information compositionally through program execution, but another gradual language called Typed Racket does Is the extra machinery in Typed Racket really worth the effort?

Over the past several months, we have curated a language design benchmark for type narrowing, If-T:

https://github.com/utahplt/ift-benchmark

The benchmark presents type system challenges in a language-agnostic way to facilitate reproducibility across languages. It also includes a datasheet to encourage cross-language comparisons that focus on fundamental typing features rather than incidental difference between languages. So far, we have implemented the benchmark for five gradual languages. There are many others to explore, and much more to learn.

The goal of this project is to replicate and extend the If-T type narrowing benchmark. Outcomes include a deep understanding of principled type narrowing, and of how to construct a benchmark that enables reproducible cross-language comparisons.

Related Work:

Type Narrowing in TypeScript https://www.typescriptlang.org/docs/handbook/2/narrowing.html
Type Narrowing in Python https://typing.readthedocs.io/en/latest/spec/narrowing.html#typeguard
Logical Types for Untyped Languages https://doi.org/10.1145/1863543.1863561

Evaluate New Gradual Languages

Topics: benchmark implementation, programming languages, types
Skills: Ruby, Lua, Python, Clojure, or PHP
Difficulty: Medium
Size: Small
Mentor: Ben Greenman

Bring the If-T Benchmark to new typecheckers. Examples include Sorbet, Hack, Luau, Pyre, Cinder / Static Python, Typed Clojure, and (potentially) Elixir. Conduct a scientific, cross-language analysis to discuss the implications of benchmark results.

Do Unsound Narrowings Lead to Exploits?

Topics: corpus study, types, counterexamples
Skills: TypeScript or Python
Difficulty: Medium
Size: Small
Mentor: Ben Greenman

Investigate type narrowing in practice through a corpus study of software projects. Use the GitHub or Software Heritage APIs to search code for user-defined predicates and other instances of narrowing. Search for vulnerabilities due to the unsound typing of user-defined predicates.

Static Python Perf: Measuring the Cost of Sound Gradual Types

Sat, 06 Jan 2024 00:00:00 +0000

Gradual typing is a solution to the longstanding tension between typed and untyped languages: let programmers write code in any flexible language (such as Python), equip the language with a suitable type system that can describe invariants in part of a program, and use run-time checks to ensure soundness.

For now, though, the cost of run-time checks can be enormous. Order-of-magnitude slowdowns are common. This high cost is a main reason why TypeScript is unsound by design — its types are not trustworthy in order to avoid run-time costs.

Recently, a team at Meta built a gradually-typed variant of Python called (drumroll) Static Python. They report an incredible 4% increase in CPU efficiency at Instagram thanks to the sound types in Static Python. This kind of speedup is unprecedented.

Other languages may want to follow the Static Python approach to gradual types, but there are big reasons to doubt the Instagram numbers:

the experiment code is closed source, and
the experiment itself is not easily reproducible (even for Instagram!).

Static Python needs a rigorous, reproducible performance evaluation to test whether it is indeed a fundamental advance for gradual typing.

Related Work:

Gradual Soundness: Lessons from Static Python https://programming-journal.org/2023/7/2/
Producing Wrong Data Without Doing Anything Obviously Wrong! https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
On the Cost of Type-Tag Soundness https://users.cs.utah.edu/~blg/resources/pdf/gm-pepm-2018.pdf

Design and Run an Experiment

Topics: performance, cluster computing, statistics
Skills: Python AST parsing, program generation, scripting, measuring performance
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Ben Greenman

Design an experiment that covers the space of gradually-typed Static Python programs in a fair way. Since every variable in a program can have up to 3 different types, there are easily 3^20 possibilities in small programs — far too many to measure exhaustively.

Run the experiment on an existing set of benchmarks using a cluster such as CloudLab. Manage the cluster machines across potentially dozens of reservations and combine the results into one comprehensive view of Static Python performance.

Derive Benchmarks from Python Applications

Topics: types, optimization, benchmark design
Skills: Python
Difficulty: Medium
Size: Small to Large
Mentor: Ben Greenman

Build or find realistic Python applications, equip them with rich types, and modify them to run a meaningful performance benchmark. Running a benchmark should produce timing information, and the timing should not be significantly influenced by random variables, I/O actions, or system events.

Ben Greenman | UCSC OSPO