docs
Architecture & Security
Cloud

Cloud

Introduction

The Cloud is a computing power marketplace designed to gather idle GPU hosts from various locations and systematically integrate them. It offers these resources to users through a flexible rental model, empowering AI developers with scalable computing power. Aligned with our mission to unlock and repurpose underutilized computing resources, the platform delivers robust GPU capabilities while streamlining the deployment and management of these assets for users.

Key Features

  1. Decentralized Resource Pooling: By deploying platform images, the platform can aggregate globally dispersed and idle GPU resources into a vast computing resource pool. This decentralized aggregation not only enhances resource utilization but also mitigates the risk of single points of failure.

  2. Dynamic Resource Allocation and Scheduling: The platform automatically performs dynamic allocation and scheduling of GPU resources based on user needs. It optimizes resource distribution by considering factors such as workload, priority, and geographical location of computing tasks, ensuring efficient task execution.

  3. User Interface and API Support: The platform offers an intuitive and user-friendly interface, making it easy for users to manage resources, monitor data, and operate the system. This reduces the barrier to entry for users and boosts productivity.

Deployment Architecture

The Cloud is divided into two parts:

  1. GPU Resource Collection Platform: This consists of the Agent, Access, and Master components. The Agent (deployed on GPU hosts) is responsible for data collection and reporting. The Access component handles data relay and security protection. The Master component is responsible for GPU identification and registration, as well as the aggregation and storage of various monitoring data.

  2. Management Platform: This platform is composed of resource management, monitoring system, console, and various operational systems. It primarily manages and optimizes the allocation, usage, monitoring, and scheduling of GPU resources, enhancing visibility into GPU resource management, optimizing resource utilization, ensuring system stability, and guaranteeing service quality (SLA) for users.

image

GPU Resource Collection Module

The GPU resource collection module is primarily responsible for gathering relevant information about GPU hardware and its operational status. This data includes, but is not limited to, GPU model, memory capacity, core frequency, temperature, power consumption, and current load. Based on the collected information, the module identifies and classifies GPUs, enabling targeted resource management and scheduling. GPUs can be categorized into different classes or tiers based on model, performance parameters, and other characteristics.

Master

The Master component is responsible for automatic GPU identification, performance evaluation, and monitoring data collection. It records this data in a database, providing crucial information for the management platform and monitoring system.

  1. GPU Node Identification and Aggregation: When resource providers connect their GPU resources to the platform, the Master identifies and classifies these resources based on basic information reported by the Agent. This includes evaluating GPU performance metrics such as processing speed, memory, and network bandwidth, and categorizing them into different classes or tiers.

  2. GPU Monitoring Data Aggregation: The Master aggregates monitoring data reported by the Agent and stores it in a database, enabling real-time monitoring of GPU resource usage, including memory utilization, compute unit usage, power consumption, and more, to ensure efficient resource allocation and usage.

  3. Compatibility and Scalability: The module supports GPUs from various brands and models, ensuring broad applicability.

  4. Resource Management: It maintains the state machines of all Agents, determining the availability of GPU resources.

  5. Task Execution and Feedback: The Master communicates with Agents to issue specific commands, such as cluster creation or deletion, and tracks the execution status of these commands.

  6. API and Integration Support: Provides APIs or interfaces to integrate with other systems or modules, enabling more advanced functionality.

Access

The Access component serves as the intermediary between the Agent and the Master, acting as the entry point for Agents into the system. Agents connect to the nearest Access module through intelligent DNS routing.

  1. Intelligent Routing and Selection: Based on the actual network environment of the GPU host and the user's geographic location, Access automatically selects the optimal connection point to minimize network latency and improve access speed.

  2. Load Balancing: Intelligent routing also enables load balancing across Access points, preventing server overload and enhancing the overall stability and reliability of the system.

  3. Failover: In the event of an Access point failure, the system automatically reroutes the Agent to another available service, ensuring the continuous reporting of Agent data.

Agent

The Agent is deployed on the GPU host and leverages system-level interfaces to gather detailed GPU information for reporting.

  1. Hardware Information Collection: The Agent collects and reports basic GPU information such as core count, manufacturer, and CUDA version. It communicates directly with the GPU hardware, using interfaces like PCI-E and NVLink to obtain fundamental information and status data.

  2. Performance Monitoring Data Collection: It monitors and reports GPU performance metrics such as utilization and memory usage, integrating or invoking existing performance monitoring tools (e.g., NVIDIA’s nvidia-smi, AMD’s Radeon Software) to gather detailed performance data.

  3. Network Speed Testing: The Agent periodically tests and reports the upload and download speeds of the GPU host.

  4. GPU Availability Detection: The Agent uses APIs provided by GPU drivers to call relevant functions, obtaining detailed parameters and real-time status to assess GPU availability and compute performance, and then reports this information.

  5. Command Execution and Feedback: The Agent executes commands issued by the Master, such as creating or deleting clusters, and provides feedback on the results.

Management Platform Module

The management platform is primarily responsible for the efficient utilization and management of GPU resources, simplifying deployment and management processes, enhancing computational performance and efficiency, and improving the platform's security and reliability. It consists of two main components: the Resource Management Platform and the Monitoring System.

Resource Management

The resource management module handles the centralized management and scheduling of GPU resources, including their allocation, reclamation, and monitoring. This module ensures real-time visibility of GPU usage, enabling optimal resource allocation and utilization:

  1. GPU Resource Allocation and Scheduling: Dynamically allocates GPU resources to different users based on their needs and resource availability.

  2. Real-Time GPU Usage Management: Continuously monitors GPU resource usage, including memory consumption, compute unit utilization, and power consumption, ensuring that resources are allocated and used efficiently.

  3. GPU Virtualization and Isolation: Virtualizes and isolates GPU resources to allow multiple users or tasks to securely and efficiently share GPU resources.

  4. Task Scheduling and Load Balancing: Employs advanced scheduling algorithms, such as fair scheduling and priority-based scheduling, to dynamically allocate GPU resources according to task requirements and system status.

  5. Security and Monitoring: Provides robust security mechanisms, including identity authentication and access control, to ensure the secure use of GPU resources.

Monitoring System

The monitoring system focuses on enhancing visibility into GPU resource management, optimizing resource utilization, ensuring system stability, and guaranteeing user service quality:

  1. Real-Time Monitoring and Data Collection: Collects multi-dimensional data in real-time, monitoring various GPU metrics such as utilization, memory usage, temperature, power consumption, and the impact of job processes on GPU performance.

  2. Data Analysis: Analyzes the collected data to gain insights into GPU resource usage, identify performance bottlenecks, and detect potential issues.

  3. Fault Detection and Alerts: Detects abnormal GPU usage, such as excessively high utilization or other unusual metrics, and triggers alerts to enable prompt action to resolve issues.

  4. Multi-Dimensional Reporting: Generates reports across various dimensions based on customer needs, providing valuable data insights.

User Interface and Interaction

The user interface and interaction module is responsible for providing users with a user-friendly interface and interactive experience. This module makes it easy for users to submit computing tasks, query results, and manage GPU resources.

  1. Intuitive Visualization: The user interface offers clear visual representations of GPU resource usage, performance charts, and more.

  2. One-Click Management: Provides an easy-to-use interface that allows users to manage GPU resources with a single click, submit computing tasks, check task statuses, and retrieve results.

  3. API Support: Offers a rich set of APIs, enabling users to access various reports and data efficiently.