Large language models offer incredible new capabilities, expanding the frontier of what is possible with AI. But their large size and unique execution characteristics can make them difficult to use in cost-effective ways.
NVIDIA has been working closely with leading companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML, now a part of Databricks, OctoML, Tabnine, and Together AI, to accelerate and optimise LLM inference.
Those innovations have been integrated into the open-source NVIDIA TensorRT-LLM software, available for Ampere, Lovelace, and Hopper GPUs and set for release in the coming weeks.
TensorRT-LLM consists of the TensorRT deep learning compiler and includes optimised kernels, pre- and post-processing steps, and multi-GPU/multi-node communication primitives for groundbreaking performance on NVIDIA GPUs. It enables developers to experiment with new LLMs, offering peak performance and quick customization capabilities, without requiring deep knowledge of C++ or NVIDIA CUDA.
TensorRT-LLM improves ease of use and extensibility through an open-source modular Python API for defining, optimising, and executing new architectures and enhancements as LLMs evolve, and can be customised easily.
For example, MosaicML has added specific features that it needs on top of TensorRT-LLM seamlessly and integrated them into their inference serving. Naveen Rao, vice president of engineering at Databricks notes that “it has been an absolute breeze.”
“TensorRT-LLM is easy to use, feature-packed with streaming of tokens, in-flight batching, paged-attention, quantisation, and more, and is efficient,” Rao said. “It delivers state-of-the-art performance for LLM serving using NVIDIA GPUs and allows us to pass on the cost savings to our customers.”
Performance Comparison
Summarising articles is just one of the many applications of LLMs. The following benchmarks show performance improvements brought by TensorRT-LLM on the latest NVIDIA Hopper architecture.
The following figures reflect article summarisation using an NVIDIA A100 and NVIDIA H100 with CNN/Daily Mail, a well-known dataset for evaluating summarisation performance.
In Figure 1, H100 alone is 4x faster than A100. Adding TensorRT-LLM and its benefits, including in-flight batching, result in an 8X total increase to deliver the highest throughput.
On Llama 2—a popular LLM released recently by Meta and used widely by organisations looking to incorporate generative AI—TensorRT-LLM can accelerate inference performance by 4.6x compared to A100 GPUs.
TCO and Energy Efficiency Improvements
Minimising total cost of ownership (TCO) and energy consumption in the data centre are key goals for customers adopting AI, and LLMs in particular given their explosive increase in computational requirements. Customers do not just look at the cost of a single server when it comes to their AI platform expenditures. Rather they have to look at aggregate capital and operational costs.
Capital costs include the cost of GPU servers, management head nodes (CPU servers to coordinate all the GPU servers), networking equipment (fabric, ethernet and cabling) and storage. Operational costs include data centre IT staff and software, equipment maintenance, as well as data centre rent and electricity. Taken at a holistic level of the actual costs incurred by a data centre, significant performance speedups reduce the equipment and maintenance requirements, leading to sizable capital and operational expense savings.
The chart below demonstrates that an 8x performance speedup on small language models like GPT-J 6B lead to a 5.3x reduction in TCO and a 5.6x reduction in energy (electricity bill savings) over the A100 baseline:
Similarly, on state-of-the-art LLMs like Llama2, even with 70 billion parameters, customers can realize a 4.6x performance speedup which results in a 3x reduction in TCO, and a 3.2x reduction in energy consumed versus the A100 baseline:
In addition to the TCO shown above, there are substantial labour costs associated with software development which can easily exceed infrastructure costs themselves. Investments made by NVIDIA in TensorRT, TensortRT-LLM, Triton Inference server and NeMo Frameworks save developers a great deal of time as well as reduce time to market. Customers need to factor in these labor costs, which can easily exceed capital and operational costs, to develop a true picture of their aggregate AI expenditures.
LLM Ecosystem Explosion
The ecosystem is innovating rapidly, developing new and diverse model architectures. Larger models unleash new capabilities and use cases. Some of the largest, most advanced language models, like Meta’s 70-billion-parameter Llama 2, require multiple GPUs working in concert to deliver responses in real time. Previously, developers looking to achieve the best performance for LLM inference had to rewrite and manually split the AI model into fragments and coordinate execution across GPUs.
TensorRT-LLM uses tensor parallelism, a type of model parallelism in which individual weight matrices are split across devices. This enables efficient inference at scale–with each model running in parallel across multiple GPUs connected through NVLink and across multiple servers–without developer intervention or model changes.
As new models and model architectures are introduced, developers can optimise their models with the latest NVIDIA AI kernels available open source in TensorRT-LLM. The supported kernel fusions include cutting-edge implementations of FlashAttention and masked multi-head attention for the context and generation phases of GPT model execution, along with many others.
Additionally, TensorRT-LLM includes fully optimised, ready-to-run versions of many LLMs widely used in production today. This includes Meta Llama 2, OpenAI GPT-2 and GPT-3, Falcon, Mosaic MPT, BLOOM, and a dozen others, all of which can be implemented with the simple-to-use TensorRT-LLM Python API.
These capabilities help developers create customised LLMs faster and more accurately to meet the needs of virtually any industry.
In-Flight Batching
Today’s large language models are extremely versatile. A single model can be used simultaneously for a variety of tasks that look very different from one another. From a simple question-and-answer response in a chatbot to the summarisation of a document or the generation of a long chunk of code, workloads are highly dynamic, with outputs varying in size by several orders of magnitude.
This versatility can make it difficult to batch requests and execute them in parallel effectively—a common optimisation for serving neural networks—which could result in some requests finishing much earlier than others.
To manage these dynamic loads, TensorRT-LLM includes an optimised scheduling technique called in-flight batching. This takes advantage of the fact that the overall text generation process for an LLM can be broken down into multiple iterations of execution on the model.
With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the TensorRT-LLM runtime immediately evicts finished sequences from the batch. It then begins executing new requests while other requests are still in flight. In-flight batching and the additional kernel-level optimizations enable improved GPU usage and minimally double the throughput on a benchmark of real-world LLM requests on H100 Tensor Core GPUs, helping reduce energy costs and minimise TCO.
H100 Transformer Engine with FP8
LLMs contain billions of model weights and activations, typically trained and represented with 16-bit floating point (FP16 or BF16) values where each value occupies 16 bits of memory. At inference time, however, most models can be effectively represented at lower precision, like 8-bit or even 4-bit integers (INT8 or INT4), using modern quantisation techniques.
Quantisation is the process of reducing the precision of a model’s weights and activations without sacrificing accuracy. Using lower precision means that each parameter is smaller, and the model takes up less space in GPU memory. This enables inference on larger models with the same hardware while spending less time on memory operations during execution.
NVIDIA H100 GPUs with TensorRT-LLM give users the ability to convert their model weights into a new FP8 format easily and compile their models to take advantage of optimized FP8 kernels automatically. This is made possible through Hopper Transformer Engine technology and done without having to change any model code.
The FP8 data format introduced by the H100 enables developers to quantize their models and radically reduce memory consumption without degrading model accuracy. FP8 quantization retains higher accuracy compared to other data formats like INT8 or INT4 while achieving the fastest performance and offering the simplest implementation.
Get Started with TensorRT-LLM
NVIDIA TensorRT-LLM is now available in early access and soon will be integrated into the NVIDIA NeMo framework—part of NVIDIA AI Enterprise, an enterprise-grade AI software platform with security, stability, manageability, and support. Developers and researchers will be able to access TensorRT-LLM through the NeMo framework on NGC or through the source repository on GitHub.
Note that you must be registered in the NVIDIA Developer Program to apply for the early access release. You must also be logged in using your organisation’s email address. We cannot accept applications from accounts using Gmail, Yahoo, QQ, or other personal email accounts.
Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)