According to news on February 20, the latest inference chip launched by Groq, an American artificial intelligence start-up company, for cloud large models has attracted widespread attention in the industry. Its most distinctive feature is the adoption of a new Tensor Streaming Ar

2024-03-07 12:34:03 entertainment 5024℃

html According to news on February 20, the latest inference chip launched by American artificial intelligence startup groq for cloud large models has attracted widespread attention in the industry. Its most distinctive feature is the adoption of a new tensor streaming architecture (tsa) architecture and ultra-high-bandwidth SRAM, which increases its inference speed for large models by more than 10 times, even surpassing NVIDIA GPUs.

inference speed is 10 times that of NVIDIA GPU, and the power consumption is only 1/10.

According to reports, groq's large model inference chip is the world's first LPU (language processing unit) solution, and is a tensor streaming processor based on the new tsa architecture ( tsp) chips designed to improve the performance of compute-intensive workloads such as machine learning and artificial intelligence.

Although groq's LPU does not use a more expensive cutting-edge process technology, but chooses the 14nm process, but with the self-developed tsa architecture, the groq LPU chip has a high degree of parallel processing capabilities and can process millions of data streams at the same time. , and the chip also integrates 230mb capacity sram to replace dram to ensure memory bandwidth, and its on-chip memory bandwidth is as high as 80tb/s.

According to official data, the performance of groq's LPU chip is quite good, it can provide up to 1000 tops (tera operations per second) computing power, and its performance on some machine learning models can be better than conventional GPU and TPU Improved by 10 to 100 times.

groq stated that the calculation and response speed of the cloud server based on its LPU chip in the llama2 or mistreal model far exceeds that of chatgpt based on nvidia ai gpu, which can generate up to 500 tokens per second. In comparison, the current public version of chatgpt-3.5 can only generate about 40 tokens per second. Since chatgpt-3.5 is mainly based on nvidia gpu, that is to say, the response speed of groq lpu chip is more than 10 times that of nvidia gpu. Groq said that compared with the large model inference performance of other cloud platform vendors, the large model inference performance of cloud servers based on its LPU chips was ultimately 18 times faster than other cloud platform vendors.

In addition, in terms of energy consumption, NVIDIA GPU requires about 10 to 30 joules to generate tokens in the response, while the groq LPU chip only needs 1 to 3 joules. While the inference speed is greatly increased by 10 times, its energy consumption cost is only One-tenth of NVIDIA GPU, which is equivalent to a 100-fold increase in cost performance. The company

groq demonstrated the powerful performance of its chip in the demonstration, supporting mistral ai's mixtral8x7b smoe, as well as meta's llama2's 7b and 70b and other models. It supports the use of a context length of 4096 bytes, and can directly experience the demo. Not only that, groq also called on major companies, threatening to surpass Nvidia within three years. The company's LPU inference chips currently sell for more than $20,000 on third-party websites, which is lower than the $25,000-$30,000 price of Nvidia H100. According to

information, groq is an artificial intelligence hardware startup founded in 2016. The core team comes from Google's original tensor processing unit (TPU) engineering team. Groq founder and CEO Jonathan Ross is the core developer of Google's TPU project. Jim Miller, the company's vice president of hardware engineering, was the person in charge of designing computing hardware for Amazon's cloud computing service AWS, and also led all Pentium II projects at Intel. The company has now raised over $62 million. Why does

use large-capacity sram? The

groq LPU chip has a very different temporal instruction set computer architecture than most other startups and existing AI processors. It is designed as a powerful single-threaded stream processor equipped with a specially designed An instruction set designed to leverage tensor operations and tensor moves to enable machine learning models to perform more efficiently. Unique to this architecture is the interaction between execution units, on-chip SRAM memory and other execution units. It doesn't require loading data from memory as frequently as a GPU using HBM (High Bandwidth Memory). The magic of

groq lies not only in the hardware, but also in the software. Software-defined hardware plays an important role here. groq's software compiles tensorflow models or other deep learning models into independent instruction streams, highly orchestrated and orchestrated in advance. The orchestration comes from the compiler.It determines and plans the entire execution in advance, resulting in very deterministic calculations. "This determinism comes from the fact that our compiler statically schedules all instruction units. This allows us to expose instruction-level parallelism without making any aggressive guesses. There are no branch target buffers or cache proxies on the chip," groq's Chief Architect Dennis Abts explains. In order to maximize performance, the groq LPU chip adds more SRAM memory and execution blocks. The full name of

sram is "static random-access memory" (static random-access memory), which is a type of random-access memory. The so-called "static" means that as long as the memory is powered on, the data stored in it can be maintained constantly. In contrast, the data stored in dynamic random access memory (DRAM) needs to be updated periodically. For more than 60 years since SRAM was introduced, it has been the memory of choice for low-latency and high-reliability applications.

In fact, for AI/ML applications, SRAM not only has its own advantages. "SRAM is crucial for AI, especially embedded SRAM. It is the highest performance memory and can be directly integrated with high-density logic cores. Currently, SRAM is also integrated on-chip by many CPUs (closer to the CPU computing unit ), as the CPU's cache, allows the CPU to obtain important data from the SRAM more directly and quickly without having to read it from the DRAM. However, the SRAM capacity of current flagship CPUs is only a few dozen at most. mb. The main reasons why

groq chose to use large-capacity SRAM to replace DRAM memory are as follows:

1, the access speed of SRAM memory is much faster than that of DRAM memory, which means that the LPU chip processes data more quickly, thereby improving Computing performance.

2, SRAM memory does not have the refresh delay of DRAM memory, which means that the LPU chip can also process data more efficiently and reduce the impact of delay.

3, SRAM memory has lower power consumption than DRAM memory, which means that the LPU The chip can manage energy consumption more effectively, thereby improving efficiency.

However, for SRAM, it also has some disadvantages:

1. Larger area: While logic transistors continue to shrink with the CMOS process, SRAM's shrinkage is very Difficulties. In fact, as early as the 20nm era, SRAM could not be scaled down with the scaling of logic transistors.

2. Small capacity: The capacity of SRAM is much smaller than that of DRAM. This is because each bit of data requires more Transistors are used for storage, and it is very difficult to shrink SRAM, so that the capacity of SRAM is much lower than that of DRAM and other memories in the same area. This also limits the application of SRAM when facing the need to store large amounts of data.

3, high cost: The cost of sram is much higher than that of dram. In addition, with the same capacity, sram requires more transistors to store data, which also makes its cost higher.

In general, although sram is better in size, capacity and cost, etc. It has some disadvantages that limit its use in some applications, but sram's access speed is much faster than dram, which makes it perform very well in some computationally intensive applications. The groq lpu chip uses Large-capacity SRAM memory can provide higher bandwidth (up to 80TB/s), lower power consumption and lower latency, thereby improving the efficiency of compute-intensive workloads such as machine learning and artificial intelligence.

So, compared with the current AI Compared with the HBM memory installed in the GPU, what are the advantages and disadvantages of the SRAM memory integrated into the Groq LPU chip? Although the SRAM memory capacity of the

groq LPU chip is 230MB, in comparison, the HBM capacity in the AI GPU is usually tens of GB (such as nvidia h100, which integrates 80GB HBM), which also means that the LPU chip may not be able to handle more Larger data sets and more complex models. Under the same capacity, the cost of SRAM is also higher than that of HBM. However, compared with HBM, the SRAM integrated in the Groq LPU chip still has the advantages of faster bandwidth (the HBM bandwidth of Nvidia H100 is only 3TB/s), lower power consumption, and lower latency. Can

replace nvidia h00?

Although the data released by groq seems to indicate that the inference speed of its LPU chip is more than 10 times that of Nvidia GPU, and the energy cost is only one-tenth of it, which is equivalent to a 100-fold increase in cost performance.However, groq also clearly points out which nvidia GPU product it is comparing. Since Nvidia's most mainstream AI GPU is currently H100, we will compare NVIDIA H100 with GROQ LPU.

Since the groq lpu only has 230mb of on-chip sram as memory, if you want to run the llama-2 70b model, even if the llama 2 70b is quantized to int8 precision, you still need about 70gb of memory. Even if you completely ignore the memory consumption, you still need 305 A groq lpu accelerator card is enough. If memory consumption is taken into account, 572 Groq LPU accelerator cards may be needed. Official data shows that the average power consumption of Groq LPU is 185w. Even if the power consumption of peripheral devices is not calculated, the total power consumption of 572 Groq Lpu accelerator cards is as high as 105.8kw. Assume that the price of a Groq LPU accelerator card is US$20,000. Therefore, the cost of purchasing 572 cards is as high as US$11.44 million (the scale purchase price should be lower).

According to data shared by artificial intelligence scientist Jia Yangqing, the current average price per kilowatt per month in data centers is about US$20, which means that the annual electricity bill for 572 Groq LPU accelerator cards is 105.8*200*12=254,000 US dollars.

Jia Yangqing also said that using 4 nvidia h100 accelerator cards can achieve half the performance of 572 groq lpu, which means that the performance of an 8 h100 server is roughly equivalent to 572 groq lpu. The nominal maximum power of 8 h100 accelerator cards is 10kw (actually about 8-9 kilowatts), so the annual electricity bill is only US$24,000 or slightly less. The current price of a server with eight h100 accelerator cards is about US$300,000.

Obviously, in comparison, when running the llama-2 70b model with the same int8 precision, the actual price/performance ratio of nvidia h00 is much higher than that of groq lpu. Even if we compare

with the llama-2 7b model with fp16 accuracy, it requires a minimum of 14gb of memory to run and about 70 groq lpu accelerator cards to deploy. Calculated based on a single card fp16 computing power of 188tflops, its total computing power will reach About 13.2pflops. It is a waste to use such strong computing power only to infer the llama-2 7b model. In comparison, a single nvidia h100 accelerator card, its integrated 80gb hmb is enough to deploy 5 fp16 precision llama-2 7b models, and the h100 has about 2pflops at fp16 computing power. Even if you want to achieve the same computing power of 70 Groq LPU accelerator cards, you only need an 8-card Nvidia h100 server to achieve it. Calculated from the hardware cost alone,

costs about 1.4 million US dollars for 70 groq LPU accelerator cards, and the price of a server with 8 h100 accelerator cards is about 300,000 US dollars. Obviously, for the llama-2 7b model running fp16 precision , the cost performance of using nvidia h100 is also much higher than groq lpu.

Of course, this is not to say that groq lpu has no advantages over nvidia h100. As mentioned earlier, the main advantage of groq lpu is that it uses large-capacity sram memory and has an ultra-high memory of 80tb/s. Bandwidth makes it very suitable for application scenarios where smaller models require frequent access to data from memory. Of course, the disadvantage is that the memory capacity of sram is small. To run large models, more groq lpu is needed. So, can groq LPU further increase its SRAM memory capacity to make up for this shortcoming? The answer is of course yes, but this will bring about a significant increase in the area and cost of the groq LPU, and will also cause power consumption issues. Perhaps in the future, groq may consider adding hbm/dram to improve the adaptability of the lpu.

Editor: Xinzhixun-Rurounijian

Tags： entertainment

Prev post： Since the leak of Huang Yishu's sex video was exposed, because his sister-in-law was the disseminator of the video, many people have always believed that there was a so-called "uncle-sister-in-law incest" drama. However, from the beginning to the end of the incident, there has be

Next post： The dream of "throwing in a novel and coming out as a blockbuster" has never been closer to reality. Recently, the American artificial intelligence research company OpenAI released its first Vincent video model Sora, which can use text commands to generate 1-minute high-definitio