Erdma

Introduction

In today's AI landscape, networking is crucial. Centralized computing networks use RDMA (Remote Direct Memory Access) for low latency and high bandwidth data transmission, supporting distributed training, data center interconnection, and GPU acceleration. However, decentralized networks need even stronger networking solutions to ensure seamless AI operations. That's why we've developed ERDMA (Enhanced Remote Direct Memory Access), designed to overcome public network interconnection challenges and create an AI computing network spanning hundreds of kilometers.

Traditional communication libraries like Gloo rely on TCP, which, while reliable, introduces significant latency and consumes substantial resources. TCP's performance issues become particularly pronounced in high-latency, high-packet-loss, or complex network environments. Furthermore, TCP/IP struggles with Network Address Translation (NAT) and firewall restrictions, limiting the scalability and flexibility of distributed systems.

ERDMA addresses these issues by enhancing transmission efficiency, optimizing protocol performance, and improving network adaptability. Our design features four key innovations:

Integration of QUIC Protocol: We replaced the underlying communication protocol with the UDP-based QUIC protocol. QUIC not only inherits the reliability and congestion control mechanisms of TCP but also offers lower latency and higher transmission efficiency, making it particularly well-suited for use in unstable network environments.
Incorporation of P2P Technology: By integrating Peer-to-Peer (P2P) technology, we enhanced the library's ability to penetrate NATs, enabling direct communication between nodes even in complex network environments, thereby overcoming the limitations of traditional communication methods.
DPDK Optimization: To further improve protocol stack performance, we utilized the Data Plane Development Kit (DPDK) to optimize the communication library, leveraging efficient memory management and data plane acceleration to significantly boost data processing efficiency.
GPU-Accelerated Compression: To reduce data transmission volume and improve bandwidth utilization, we introduced GPU-accelerated compression, leveraging the parallel processing power of GPUs to efficiently compress data before transmission, thereby enhancing overall transmission efficiency.

The primary contribution of this paper is the development of a new collective communications library that combines QUIC, P2P, DPDK, and GPU-accelerated compression to achieve a highly efficient, low-latency, and robust communication solution for distributed systems. Experimental results demonstrate that this approach offers significant advantages in terms of transmission efficiency, network adaptability, and system performance. Through these improvements, we aim to provide a more resilient and efficient communication solution for future distributed computing systems.

Design and Implementation

This section provides a detailed description of the design and implementation of our novel collective communications library, highlighting how QUIC, P2P technology, DPDK, and GPU-accelerated compression are integrated to achieve efficient UDP transmission, NAT traversal, protocol optimization, and data compression.

1. Implementation of the QUIC Protocol

1.1 Advantages of the QUIC Protocol

QUIC (Quick UDP Internet Connections) is a transport layer protocol developed by Google, designed to improve network transmission efficiency and reduce latency. Compared to traditional TCP, QUIC offers several key advantages:

• Low-latency Connection Establishment: QUIC combines the functionalities of TCP and TLS, allowing for connection establishment and handshake within a single round-trip time (RTT), significantly reducing initial latency.

• Multiplexing: QUIC supports multiple data streams within a single connection, avoiding the head-of-line blocking problem inherent in TCP, thereby improving transmission efficiency.

• Built-in Encryption: QUIC defaults to encryption, simplifying security configuration and enhancing transmission security.

1.2 Application of QUIC in Collective Communications

We replaced the TCP-based transmission module in Gloo with a QUIC-based transmission module to leverage QUIC's advantages in low-latency and efficient transmission. The implementation involves the following steps:

• Protocol Integration: We integrated the QUIC protocol stack into the collective communications library as the underlying transport protocol. We utilized an open-source QUIC implementation (such as quic-go) and made necessary customizations to meet the requirements of collective communication.

• Reliability Handling: Although QUIC is fundamentally based on UDP, it incorporates TCP-like reliability mechanisms (such as ACKs and retransmissions). In our implementation, we ensured that critical data flows (such as synchronization operations and parameter updates) could be reliably transmitted using QUIC.

• Multiplexing Optimization: In the design of the collective communications library, we fully utilized QUIC's multiplexing capabilities to reduce network overhead and latency. Each communication node can transmit multiple data streams simultaneously over a single QUIC connection, thereby improving overall communication efficiency.

2. P2P Technology Integration

2.1 Overview of the P2P Network Model

Peer-to-Peer (P2P) is a distributed network architecture that allows nodes to communicate directly with each other without relying on a central server. P2P technology offers significant advantages in NAT traversal, resource sharing, and adaptability to dynamic network topologies.

2.2 Application of P2P Technology for NAT Traversal

To enable the collective communications library to operate seamlessly in complex network environments (such as behind NATs or firewalls), we integrated P2P technology. The implementation includes the following steps:

• Node Discovery and Bootstrapping: Using technologies such as Distributed Hash Table (DHT), we enabled automatic discovery and bootstrapping of nodes. When a node joins the network, it uses DHT to find other communicable nodes and establish initial connections.

• NAT Traversal: We combined STUN (Session Traversal Utilities for NAT) and TURN (Traversal Using Relays around NAT) protocols to assist nodes in traversing NAT devices and achieving P2P connectivity. The STUN protocol is used to detect the type of NAT and attempt direct connections, while the TURN protocol acts as a fallback option to relay data.

• Connection Management: To handle the potential fluctuations in a P2P network, we designed a robust connection management mechanism that monitors the status of connections and automatically retries or switches to alternate paths if a connection is lost.

3. DPDK Optimization

3.1 Overview of DPDK

The Data Plane Development Kit (DPDK) is a set of libraries and drivers for accelerating data plane processing, widely used in high-performance networking applications. DPDK achieves efficient packet processing through techniques such as zero-copy, batch processing, and memory pool management, significantly enhancing the performance of the network protocol stack.

3.2 DPDK-based Protocol Optimization

In our project, we utilized DPDK to optimize the network protocol stack of the collective communications library. The implementation steps include:

• Zero-copy Mechanism: By leveraging DPDK's zero-copy mechanism, data packets are processed directly in user space, reducing the context switching and data copying between kernel space and user space, thereby lowering transmission latency and CPU consumption.

• Batch Processing: DPDK supports batch processing of data packets, which helps reduce processing overhead and increase throughput. In our implementation, the sending and receiving operations of data packets were batch-processed to further enhance network processing efficiency.

• Memory Management: DPDK provides efficient memory pool management, optimizing the process of memory allocation and release. We used DPDK's memory pools to manage network buffers, reducing memory fragmentation and improving memory utilization.

• Hardware Acceleration: By utilizing DPDK's tight integration with network hardware, we enabled hardware acceleration features such as Receive Side Scaling (RSS) and offloading, further optimizing the efficiency of network data processing.

4. GPU-Accelerated Compression Implementation

4.1 Background and Motivation

In distributed systems, transmission speed and bandwidth utilization are critical to overall system performance. As models and data volumes grow, reducing the size of transmitted data while maintaining transmission efficiency becomes key to improving overall system performance. To address this, we introduced GPU-accelerated compression, leveraging theparallel processing power of GPUs to efficiently compress data before transmission, thereby reducing the bandwidth required for data transmission.

4.2 GPU Acceleration Implementation

To achieve efficient GPU-accelerated compression, we undertook the following steps:

• CUDA Kernel Design: We designed CUDA kernels for the selected compression algorithms, fully utilizing the parallel processing capabilities of the GPU. Each CUDA thread is responsible for processing a portion of the data block, enabling parallel compression of the data.

• Memory Management: Due to differences in memory bandwidth and management strategies between GPUs and CPUs, we optimized memory allocation and data transfer in our implementation. We used CUDA streams to ensure smooth data transfer between CPU and GPU during compression, minimizing wait times and bandwidth bottlenecks.

• Pipeline Processing of Compression and Transmission: To further enhance efficiency, we pipelined the compression and transmission processes. Before data is transmitted via the QUIC protocol, it is compressed on the GPU, and while the data is being transmitted, the next batch of data is processed. This parallel pipeline approach maximizes the utilization of system computing resources and bandwidth.

4.3 Performance Evaluation and Optimization

After integrating GPU-accelerated compression into the collective communications library, we conducted extensive performance testing, focusing on the following aspects:

• Compression Speed and Ratio: We evaluated the compression speed and ratio of different algorithms under various datasets and communication scales. Results showed that GPU-accelerated compression significantly reduced data transmission volume in most cases while maintaining high compression speed.

• GPU Utilization: By monitoring GPU utilization, we verified the parallel efficiency of each CUDA kernel and further optimized inefficiencies by adjusting thread block size and kernel parameters to improve parallelism.

• Bandwidth Utilization: Compared to uncompressed data transmission, GPU-accelerated compression significantly improved bandwidth utilization, especially in high-bandwidth but high-latency network environments, leading to a noticeable increase in transmission efficiency.

Conclusion

By introducing GPU-accelerated compression, we further enhanced the overall performance of the collective communications library, particularly in large-scale data transmission scenarios, significantly reducing the volume of transmitted data and improving bandwidth utilization. Combined with the optimizations provided by QUIC, P2P, and DPDK, this implementation offers robust performance improvements for distributed computing systems. Experimental results demonstrate that GPU-accelerated compression provides effective compression and transmission efficiency across various scenarios, validating its practical application value in collective communications.

About TurboIN