In recent years, the construction of computing infrastructure across the country has been in full swing. In addition to first tier cities such as Beijing, Shanghai, and Shenzhen, various county-level regions are also actively accelerating their layout. However, due to insufficient performance, inability to meet the computing power requirements of scenarios in the era of AI big models, and mismatch between data centers and local industry demands, the overall utilization rate of computing power centers is only about 50%, with a high idle rate. Measuring computing power performance solely based on computing power scale and cluster scale is obviously no longer suitable for the rapid development of the digital economy.
Beijing Electronics Numerical Intelligence Technology Co., Ltd. (hereinafter referred to as "Beidian Numerical Intelligence") has proposed the concept of "optimal solution for computing power" and further refined the evaluation criteria for computing power based on the development of the artificial intelligence industry and the iteration of computing power demand. Nortel Intelligence believes that the "optimal solution for computing power" requires "three plus one guarantee", which means accelerating single-chip computing power, strengthening heterogeneous cluster performance, increasing communication capabilities, and ensuring the safe and stable operation of intelligent computing cluster training.
Beidian Shuzhi is an artificial intelligence technology enterprise that focuses on original, disruptive, and leading technological innovation. It has achieved a full stack product and solution layout in computing power, algorithms, and data, and won the "AI Computing Power Layer Innovation Enterprise Award" in May 2024.
Accelerate single-chip computing power and truly achieve "usability"
At present, the computing power performance of domestic GPUs is not low, but most customers have reported that domestic chips are still not "user-friendly" enough. This is because the GPU products currently produced and applied in China are mainly designed for the previous generation algorithms, and further improvement is needed to meet the requirements of AI large model related operators. It is necessary to accelerate the computing power of a single chip through software such as rich operator libraries and compilers. The North Electric Smart Forward AI Heterogeneous Computing Platform has multiple optimization functions, which can accelerate the computing power of a single chip and increase the adaptive ability of domestic computing chips through model quantization acceleration, model hyperparameter optimization, sparse inference and other model optimization capabilities, operator fusion acceleration, computation graph optimization, hardware memory access optimization and other compilation optimization methods.
Multi chip hybrid heterogeneous or mainstream, allowing suitable chips to do the right things
At present, intelligent computing centers mainly supply computing power to single chip manufacturers, and the problem of insufficient computing power supply is inevitable. Hybrid heterogeneous technology can solve the problem of insufficient production capacity supply for a single chip manufacturer, while providing a more cost-effective computing solution. Due to different architecture designs, different chips are naturally suitable for different training and inference tasks. If corresponding chips can be configured for different tasks, the cost-effectiveness of the overall computing solution will be greatly improved. However, heterogeneous pooling training may have issues such as accuracy errors and synchronization. The system needs to perform uniform or non-uniform task partitioning based on model features, real-time load status, and cluster hardware characteristics under uneven computing power.
The North Electric Smart Forward AI Heterogeneous Computing Platform can optimize model performance through operator level model splitting methods; By using hardware aware automatic tuning based on automatic machine learning algorithms, the configuration and parameters of the model are automatically adjusted to find the best performance and effect under specific chips; Its framework can support the distribution of AI large models to multiple GPUs for computation, improving the efficiency of model training and inference, ensuring that each chip can undertake tasks that match its computing power.
Connecting the collection communication library to solve communication problems and improve the training performance of AI large models
In the era of Wanka cluster, communication capability directly restricts the data transmission efficiency during the training of AI large models. An efficient, stable, and low latency network is of great significance for the construction and operation of intelligent computing centers. At the hardware level, methods such as NVLink and HCCS interconnection can effectively improve the communication capability between cards; At the protocol level, RDMA is used to reduce end-to-end communication latency among multiple machines, improve transmission rates between nodes, and effectively enhance the communication efficiency of intelligent computing clusters. In addition, at the software level, the centralized communication library controls the data communication between GPUs and servers, and the differences in communication libraries between heterogeneous cards can pose communication challenges for them. Beidian Shuzhi solves the communication problem between different GPU chips by connecting the centralized communication libraries of various manufacturers, deeply adapting and optimizing the communication libraries, and ensuring information exchange within heterogeneous clusters through standardized distributed communication interfaces; And through strategies such as time overlap, the calculation process and communication process are overlapped with each other to reduce the impact of communication delay on overall training performance.
Widely manage and ensure the stable operation of the computing power cluster
There are many types and quantities of hardware in the Wanka cluster, and each component has a hardware failure rate. The failure of each hardware will affect the overall training of the intelligent computing cluster. To achieve optimal computing power, an efficient and reliable intelligent cloud management platform is needed, which provides real-time intelligent monitoring to achieve minute level software and hardware fault location, as well as automatic detection and repair of faults. The Beidian Smart Forward · AI Heterogeneous Computing Platform supports extensive management of multiple domestic chips, helping users achieve unified management of different brands and types of AI acceleration cards to ensure seamless integration and optimized utilization of various AI chips. The extensive management capability also allows users to flexibly adjust resource allocation according to specific needs, optimize computing power supply to meet various training and reasoning tasks.
The "Three Plus One Guarantee" is the optimal solution for computing power proposed by North Electric Digital Intelligence in the current context of AI model penetration into various industries. It not only optimizes the allocation of computing power and improves the utilization of computing resources, but also provides a path for enterprises to move towards intelligence and AI. It is worth mentioning that on August 21, 2024, the "Forward AI Heterogeneous Computing Platform" was also selected as one of the first "Artificial Intelligence+" application scenario cases in Beijing, marking a solid step forward in the project's application landing. In the future, Beidian Shuzhi will continue to provide low-cost, high-performance, and stable computing power supply for various industries, contributing to the construction of Digital China.