Stanford Computer Systems Reading Group

Fridays at 11 AM, Gates 392

Snap: a Microkernel Approach to Host Networking

Authors: Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C. Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Musick, Lena Olson, Erik Rubow, Michael Ryan, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, and Amin Vahdat, Google, Inc. Abstract: This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google’s rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service....

Carbink: Fault-Tolerant Far Memory

Authors: Yang Zhou, Harvard University; Hassan M. G. Wassel, Google; Sihang Liu, University of Virginia; Jiaqi Gao and James Mickens, Harvard University; Minlan Yu, Harvard University and Google; Chris Kennelly, Paul Turner, and David E. Culler, Google; Henry M. Levy, University of Washington and Google; Amin Vahdat, Google Abstract: Far memory systems allow an application to transparently access local memory as well as memory belonging to remote machines. Fault tolerance is a critical property of any practical approach for far memory, since machine failures (both planned and unplanned) are endemic in datacenters....

CausalSim: A Causal Framework for Unbiased Trace-Driven Simulation

Authors: Abdullah Alomar, Pouya Hamadanian, Arash Nasr-Esfahany, Anish Agarwal, Mohammad Alizadeh, and Devavrat Shah, MIT Abstract: We present CausalSim, a causal framework for unbiased trace-driven simulation. Current trace-driven simulators assume that the interventions being simulated (e.g., a new algorithm) would not affect the validity of the traces. However, real-world traces are often biased by the choices algorithms make during trace collection, and hence replaying traces under an intervention may lead to incorrect results....

BuildIt: A Type-Based Multi-stage Programming Framework for Code Generation in C++

Authors: Ajay Brahmakshatriya and Saman Amarasinghe, CSAIL, MIT Abstract: The simplest implementation of a domain-specific language is to embed it in an existing language using operator overloading. This way, the DSL can inherit parsing, syntax and type checking, error handling, and the toolchain of debuggers and IDEs from the host language. A natural host language choice for most high-performance DSLs is the de-facto high-performance language, C++. However, DSL designers quickly run into the problem of not being able to extract control flows due to a lack of introspection in C++ and have to resort to special functions with lambdas to represent loops and conditionals....

Finding Typing Compiler Bugs

Authors: Stefanos Chaliasos, Imperial College London; Thodoris Sotiropoulos, Athens University of Economics and Business; Diomidis Spinellis, Athens University of Economics and Business and Delft University of Technology the Netherlands; Arthur Gervais, Benjamin Livshits, Imperial College London; Dimitris Mitropoulos, University of Athens Abstract: We propose a testing framework for validating static typing procedures in compilers. Our core component is a program generator suitably crafted for producing programs that are likely to trigger typing compiler bugs....

Understanding and Exploiting Optimal Function Inlining

Authors: Theodoros Theodoridis, ETH Zurich; Tobias Grosser, University of Edinburgh; Zhendong Su, ETH Zurich Abstract: Inlining is a core transformation in optimizing compilers. It replaces a function call (call site) with the body of the called function (callee). It helps reduce function call overhead and binary size, and more importantly, enables other optimizations. The problem of inlining has been extensively studied, but it is far from being solved; predicting which inlining decisions are beneficial is nontrivial due to interactions with the rest of the compiler pipeline....

Graham: Synchronizing Clocks by Leveraging Local Clock Properties

Authors: Ali Najafi, Meta; Michael Wei, VMware Research Abstract: High performance, strongly consistent applications are beginning to require scalable sub-microsecond clock synchronization. State-of-the-art clock synchronization focuses on improving accuracy or frequency of synchronization, ignoring the properties of the local clock: lost of connectivity to the remote clock means synchronization failure. Our system, Graham, leverages the fact that the local clock still keeps time even when connectivity is lost and builds a failure model using the characteristics of the local clock and the desired synchronization accuracy....

Orca: A Distributed Serving System for Transformer-Based Generative Models

Authors: Gyeong-In Yu and Joo Seong Jeong, Seoul National University; Geon-Woo Kim, FriendliAI and Seoul National University; Soojeong Kim, FriendliAI; Byung-Gon Chun, FriendliAI and Seoul National University Abstract: Large-scale Transformer-based models trained for generation tasks (e.g., GPT-3) have recently attracted huge interest, emphasizing the need for system support for serving models in this family. Since these models generate a next token in an autoregressive manner, one has to run the model multiple times to process an inference request where each iteration of the model generates a single output token for the request....

Scalability! But at what COST?

Authors: Frank McSherry; Michael Isard; Derek G. Murray Abstract: We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads....

XRP: In-Kernel Storage Functions with eBPF

Authors: Yuhong Zhong, Haoyu Li, Yu Jian Wu, Ioannis Zarkadas, Jeffrey Tao, Evan Mesterhazy, Michael Makris, and Junfeng Yang, Columbia University; Amy Tai, Google; Ryan Stutsman, University of Utah; Asaf Cidon, Columbia University Abstract: With the emergence of microsecond-scale NVMe storage devices, the Linux kernel storage stack overhead has become significant, almost doubling access times. We present XRP, a framework that allows applications to execute user-defined storage functions, such as index lookups or aggregations, from an eBPF hook in the NVMe driver, safely bypassing most of the kernel’s storage stack....