Gen AI Guide > Fundamentals of AI-ML > GPU

GPU

A Graphics Processing Unit (GPU) is a specialized processor initially designed to accelerate the rendering of images, animations, and videos for display.

Unlike a CPU, which is optimized for general-purpose tasks, a GPU excels at parallel processing, allowing it to handle many calculations simultaneously.

This parallelism makes GPUs highly efficient for tasks that involve large-scale data processing, such as model training and inference in machine learning and AI, where high computational throughput is essential.

Why are GPU used for LLM training & inferencing?

Parallel Processing Power

GPUs are designed to handle many calculations simultaneously (parallel computing), making them ideal for the massive number of operations required in LLMs. Large neural networks contain millions or billions of parameters that need to be updated during training, and GPUs can process many of these in parallel, significantly speeding up the computation compared to CPUs, which are optimized for sequential processing.

Matrix and Tensor Operations

LLMs rely heavily on matrix and tensor operations (e.g., multiplying matrices of input data with weights, computing gradients, etc.). GPUs are specifically designed to perform these types of operations efficiently, making them much faster than CPUs for tasks involving large-scale linear algebra, which is critical for LLMs.

Memory Bandwidth

GPUs have higher memory bandwidth compared to CPUs, which allows them to quickly transfer and process large datasets and model parameters. LLMs require vast amounts of memory for storing weights and activations (output from neurons), and GPUs provide the memory resources needed to handle such large-scale models.

Training Time Reduction

Training LLMs requires extensive amounts of data and computational resources, often taking weeks or even months to complete. GPUs dramatically reduce training times due to their ability to perform parallel operations on huge datasets efficiently. In distributed GPU setups, training can be scaled across multiple GPUs, further speeding up the process.

Inference Efficiency

During inference (i.e., when the model is used to generate outputs), LLMs still perform millions or billions of operations to process a single input, especially for tasks like generating long text sequences. GPUs enable real-time or near-real-time inference by handling these computations quickly, ensuring fast response times.

Measuring the power of a GPU?

FLOP = FlOating Point Operations/sec

The power of a CPU or GPU is often measured in terms of floating-point operations per second (FLOPS). This metric indicates how quickly a processor can perform mathematical operations involving floating-point numbers, which are essential for various computational tasks, such as scientific simulations, machine learning, and video processing.

Here’s a breakdown of how FLOPS are calculated:

1. Identify the relevant operations:

CPU: Typically, the focus is on single-precision (32-bit) floating-point operations, as these are more commonly used in general-purpose computing.
GPU: GPUs often excel at double-precision (64-bit) floating-point operations, which are crucial for scientific simulations and high-precision calculations.

2. Measure the operation speed:

Benchmarking software: Specialized tools like LINPACK or SPEC CPU2017 are used to run standardized workloads on the processor.
Performance counters: These hardware features provide insights into the number of operations executed per second.

3. Calculate FLOPS:

Multiply the operation count by the clock speed: This gives the theoretical peak FLOPS.
Consider efficiency factors: In practice, the actual FLOPS achieved may be lower due to factors like memory bandwidth, instruction pipeline, and cache performance.

Key points to remember:

Peak FLOPS: This is the maximum theoretical performance based on the processor’s specifications.
Sustained FLOPS: This reflects the actual performance under real-world workloads, which can be influenced by various factors.
FLOPS/watt: This metric measures the energy efficiency of a processor, considering the number of FLOPS achieved per unit of power consumed.

Additional considerations:

Vector processing: GPUs often employ vector units that can perform multiple operations simultaneously on a single clock cycle, leading to higher FLOPS.
Tensor cores: Some GPUs have specialized hardware (tensor cores) that can accelerate matrix-matrix operations, which are common in deep learning applications.
Mixed-precision: Modern processors may support mixed-precision calculations, where certain operations are performed in lower precision to improve performance while maintaining sufficient accuracy.

By understanding the concept of FLOPS and the factors that influence it, you can better evaluate the computational power of CPUs and GPUs for various applications.

FLOPS Comparison

For i7, A100, M2 Ultra, RTX 4090, RTX 4080 Ti, RTX 4080, and RTX 4070 Ti

FLOPS Comparison: i7, A100, M2 Ultra, RTX 4090, RTX 4080 Ti, RTX 4080, and RTX 4070 Ti

Processor	Company	Type	Approximate Peak FLOPS (FP32)	Approximate Peak FLOPS (FP64)	Approximate Peak FLOPS (FP16)	Notes
GeForce RTX 4090	NVIDIA	GPU	45	1.1	90	Flagship consumer GPU, offering exceptional performance for gaming and content creation
NVIDIA A100	NVIDIA	GPU	20	5	40	High-performance GPU, often used for AI and HPC
GeForce RTX 4080 Ti	NVIDIA	GPU	30	0.75	60	High-end consumer GPU, providing excellent performance for demanding workloads
GeForce RTX 4080	NVIDIA	GPU	25	0.63	50	Mid-range high-performance GPU, suitable for a wide range of gaming and creative tasks
M2 Ultra	Apple	CPU & GPU	8	0.2	16	Apple’s most powerful chip, designed for high-performance computing
GeForce RTX 4070 Ti	NVIDIA	GPU	15	0.38	30	Mainstream high-performance GPU, offering good performance for gaming and general-purpose computing
Intel Core i7	Intel	CPU	1-3	0.25-0.75	2-6	Varies depending on specific model and generation

Key points:

FP32 FLOPS are generally higher than FP64 FLOPS, especially for GPUs.
The performance gap between FP32 and FP64 can vary significantly depending on the processor and its architecture.
FP16 FLOPS are typically double the FP32 FLOPS, but with a potential loss of precision.
For applications that require high precision, FP64 is essential. For many applications, FP32 offers a good balance of speed and accuracy.
FP16 can be used to accelerate training, especially on hardware that supports it. However, it may lead to a slight loss of precision.
Intel Core i7 is a versatile processor suitable for a wide range of tasks, including gaming and content creation. Its FLOPS can vary significantly depending on the specific model.
NVIDIA A100 is a powerful GPU designed for demanding workloads like AI and HPC. It offers significantly higher FLOPS compared to the i7.
M2 Ultra is Apple’s most powerful chip, designed for high-performance computing tasks. It offers a significant boost in graphics performance compared to previous generations.
GeForce RTX 4090 is the flagship consumer GPU, offering exceptional performance for gaming and content creation.
GeForce RTX 4080 Ti is a high-end consumer GPU, providing excellent performance for demanding workloads.
GeForce RTX 4080 is a mid-range high-performance GPU, suitable for a wide range of gaming and creative tasks.
GeForce RTX 4070 Ti is a mainstream high-performance GPU, offering good performance for gaming and general-purpose computing.

History

This chart represents a condensed overview of significant moments in GPU development.

Source: Wikipedia

Year	Development	Significance
1970s	Emergence of specialized graphics circuits in arcade games.	Laid the groundwork for dedicated graphics hardware.
1979	Introduction of the Namco Galaxian arcade system with advanced graphics features.	Popularized the use of specialized graphics hardware in arcade games.
1979	Atari 8-bit computers feature ANTIC, a video processor capable of interpreting display lists and enabling smooth scrolling.	Advanced graphics capabilities for personal computers.
1982	Release of Williams Electronics arcade games with custom blitter chips for 16-color bitmaps.	Showcased the potential of dedicated hardware for bitmap manipulation.
1984	Hitachi releases the ARTC HD63484, the first major CMOS graphics processor for PCs.	Enabled high-resolution displays (up to 4K monochrome) for personal computers.
1986	Texas Instruments introduces the TMS34010, the first fully programmable graphics processor.	Marked a shift towards programmable graphics hardware, allowing for greater flexibility.
1987	IBM releases the IBM 8514 graphics system, one of the first video cards to implement 2D primitives in hardware.	Advanced 2D graphics capabilities for IBM PC compatibles.
1988	First dedicated polygonal 3D graphics boards appear in arcades (Namco System 21 and Taito Air System).	Marked the beginning of real-time 3D graphics in a commercial setting.
1990s	Rapid evolution of 2D GUI acceleration and the rise of hardware-accelerated 3D graphics.	Led to the integration of video, 2D, and 3D capabilities on a single chip.
1994	Sony coins the term “GPU” for the PlayStation’s graphics processor.
1999	Nvidia popularizes the term “GPU” with the release of the GeForce 256, marketed as the world’s first GPU.	Solidified the term “GPU” and highlighted the increasing power and programmability of graphics processors.
Early 2000s	GPUs begin to feature programmable shading, allowing for more complex visual effects.	Marked a significant step towards GPUs becoming general-purpose computing devices.
2006	Widespread use of general-purpose computing on GPUs (GPGPU).	GPUs were no longer limited to graphics processing, opening up new possibilities in various fields.
2007	Introduction of Nvidia’s CUDA platform, the first widely adopted programming model for GPU computing.	Facilitated the development of general-purpose applications for GPUs.
2010s	GPUs continue to evolve with increased performance, efficiency, and features like hardware-accelerated ray tracing.	Led to significant advancements in gaming, professional graphics, and artificial intelligence.
2020s	GPUs become increasingly used for AI, particularly in training large language models.

Approximate GPU Costs (Dec 2024)

Model	Memory	Price Range	Typical Use Cases
NVIDIA A100	40GB	~$12,000–$15,000	High-performance servers, cloud environments (AWS, Azure, Google Cloud), fine-tuning in clusters
NVIDIA A100	80GB	~$15,000–$20,000	High-performance servers, cloud environments (AWS, Azure, Google Cloud), fine-tuning in clusters
NVIDIA H100	80GB	~$25,000–$35,000	HPC systems, hyperscale data centers, pre-training & fine-tuning
NVIDIA V100	32GB	~$6,000–$10,000	Research, enterprise applications
NVIDIA RTX 6000 Ada	48GB	~$6,500–$8,000	Small-scale data centers