Publications


Saba: Rethinking Datacenter Network Allocation from Application’s Perspective

MR Siavash Katebzadeh , Paolo Costa , and Boris Grot

In Eighteenth European Conference on Computer Systems (EuroSys) , 2023

abstract

PDF

Today’s datacenter workloads increasingly comprise distributed data-intensive applications, including data analytics, graph processing, and machine-learning training. These applications are bandwidth-hungry and often congest the datacenter network, resulting in poor network performance, which hurts application completion time. Efforts made to address this problem generally aim to achieve max-min fairness at the flow or application level. We observe that splitting the bandwidth equally among workloads is sub-optimal for aggregate application-level performance because various workloads exhibit different sensitivity to network bandwidth: for some workloads, even a small reduction in the available bandwidth yields a significant increase in completion time; for others, the completion time is largely insensitive to the available bandwidth. Building on this insight, we propose Saba, an applicationaware bandwidth allocation framework that distributes network bandwidth based on application-level sensitivity. Saba combines ahead-of-time application profiling to determine bandwidth sensitivity with runtime bandwidth allocation using lightweight software support with no modifications to network hardware or protocols. Experiments with a 32server hardware testbed show that Saba improves average completion time by 1.88× (and by 1.27× in a simulated 1,944server cluster).


Hermes: A fast, fault-tolerant and linearizable replication protocol

Antonios Katsarakis , Vasilis Gavrielatos , MR Siavash Katebzadeh , and 4 more authors

In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , 2020

abstract

PDF

Today’s datacenter applications are underpinned by datastores that are responsible for providing availability, consistency, and performance. For high availability in the presence of failures, these datastores replicate data across several nodes. This is accomplished with the help of a reliable replication protocol that is responsible for maintaining the replicas strongly-consistent even when faults occur. Strong consistency is preferred to weaker consistency models that cannot guarantee an intuitive behavior for the clients. Furthermore, to accommodate high demand at real-time latencies, datastores must deliver high throughput and low latency. This work introduces Hermes1, a broadcast-based reliable replication protocol for in-memory datastores that provides both high throughput and low latency by enabling local reads and fully-concurrent fast writes at all replicas. Hermes couples logical timestamps with cache-coherence-inspired invalidations to guarantee linearizability, avoid write serialization at a centralized ordering point, resolve write conflicts locally at each replica (hence ensuring that writes never abort) and provide fault-tolerance via replayable writes. Our implementation of Hermes over an RDMA-enabled reliable datastore with five replicas shows that Hermes consistently achieves higher throughput than state-of-the-art RDMA-based reliable protocols (ZAB and CRAQ) across all write ratios while also significantly reducing tail latency. At 5% writes, the tail latency of Hermes is 3.6× lower than that of CRAQ and ZAB.


Evaluation of an InfiniBand Switch: Choose Latency or Bandwidth, but Not Both

MR Siavash Katebzadeh , Paolo Costa , and Boris Grot

In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) , 2020

abstract

PDF

Today’s cloud datacenters feature a large number of concurrently executing applications with diverse intradatacenter latency and bandwidth requirements. To remove the network as a potential performance bottleneck, datacenter operators have begun deploying high-end HPC-grade networks, such as InfiniBand (IB), which offer fully offloaded network stacks, remote direct memory access (RDMA) capability, and non-discarding links. While known to provide both low latency and high bandwidth for a single application, it is not clear how well such networks accommodate a mix of latencyand bandwidth-sensitive traffic that is likely in a real-world deployment. As a step toward answering this question, we develop a performance measurement tool for RDMA-based networks, RPerf, that is capable of precisely measuring the IB switch performance without hardware support. Using RPerf, we benchmark a rack-scale IB cluster in isolated and mixedtraffic scenarios. Our key finding is that the evaluated switch can provide either low latency or high bandwidth, but not both simultaneously in a mixed-traffic scenario. We evaluate several options to improve the latency-bandwidth trade-off and demonstrate that none are ideal.


Bankrupt covert channel: Turning network predictability into vulnerability

Dmitrii Ustiugov , Plamen Petrov , MR Siavash Katebzadeh , and 1 more author

In 14th USENIX Workshop on Offensive Technologies (WOOT) , 2020

abstract

PDF

Recent years have seen a surge in the number of data leaks despite aggressive information-containment measures deployed by cloud providers. When attackers acquire sensitive data in a secure cloud environment, covert communication channels are a key tool to exfiltrate the data to the outside world. While the bulk of prior work focused on covert channels within a single CPU, they require the spy (transmitter) and the receiver to share the CPU, which might be difficult to achieve in a cloud environment with hundreds or thousands of machines. This work presents Bankrupt, a high-rate highly clandestine channel that enables covert communication between the spy and the receiver running on different nodes in an RDMA network. In Bankrupt, the spy communicates with the receiver by issuing RDMA network packets to a private memory region allocated to it on a different machine (an intermediary). The receiver similarly allocates a separate memory region on the same intermediary, also accessed via RDMA. By steering RDMA packets to a specific set of remote memory addresses, the spy causes deep queuing at one memory bank, which is the finest addressable internal unit of main memory. This exposes a timing channel that the receiver can listen on by issuing probe packets to addresses mapped to the same bank but in its own private memory region. Bankrupt channel delivers 74Kb/s throughput in CloudLab’s public cloud while remaining undetectable to the existing monitoring capabilities, such as CPU and NIC performance counters.