The Silicon Sovereign: The Origins, Rise, and Evolution of the Google Tensor Processing Unit (written by Gemini 3.0)

Executive Summary

The history of digital computation has been dominated by the paradigm of general-purpose processing. For nearly half a century, the Central Processing Unit (CPU) served as the universal engine of the information age. This universality was its greatest strength and, eventually, its critical weakness. As the mid-2010s approached, the tech industry faced a collision of two tectonic trends: the deceleration of Moore’s Law and the explosive growth of Deep Learning (DL).

This report provides an exhaustive analysis of Google’s response to this collision: the Tensor Processing Unit (TPU). It traces the lineage of the TPU from clandestine experiments on “coffee table” servers to the deployment of the Exascale-class TPU v7 “Ironwood.” It explores the architectural philosophy of Domain Specific Architectures (DSAs) championed by Turing Award winner David Patterson, arguing that the future of computing lies not in doing everything reasonably well, but in doing one thing—matrix multiplication—with absolute efficiency.

Through a detailed examination of seven generations of hardware, this report illustrates how the TPU enabled the modern AI revolution, powering everything from AlphaGo to the Gemini and PaLM models. It details technical specifications, the shift to liquid cooling, the introduction of optical interconnects, and the “Age of Inference” ushered in by Ironwood. The analysis suggests that the TPU is a vertically integrated supercomputing instrument that allowed Google to decouple its AI ambitions from the constraints of the merchant silicon market.

Part I: The Computational Crisis and the Birth of the TPU (2006–2015)

1.1 The Pre-History and the “Coffee Table” Era

To understand the genesis of the TPU, one must understand the physical constraints facing machine learning pioneers in the early 2010s. Before the era of polished cloud infrastructure, the hardware reality for deep learning researchers was chaotic and improvised.

In 2012, Zak Stone—who would later found the Cloud TPU program—operated in a startup environment that necessitated extreme frugality. To acquire the necessary compute power for training early neural networks, Stone and his co-founders resorted to purchasing used gaming GPUs from online marketplaces. They assembled these disparate components into servers resting on their living room coffee table. The setup was so power-inefficient that turning on a household microwave would frequently trip the circuit breakers, plunging their makeshift data center into darkness.¹

This “coffee table” era serves as a potent metaphor for the state of the industry. The hardware being used—GPUs designed for rendering video game textures—was accidentally good at the parallel math required for AI, but it was not optimized for it.

1.2 The “Back-of-the-Napkin” Catastrophe

By 2013, deep learning was moving from academic curiosity to product necessity. Jeff Dean, Google’s Chief Scientist, performed a calculation that would become legendary in the annals of computer architecture. He estimated the computational load if Google’s user base of 100 million Android users utilized the voice-to-text feature for a mere three minutes per day. The results were stark: supporting this single feature would require doubling the number of data centers Google owned globally.¹

This was an economic and logistical impossibility. Building a data center is a multi-year, multi-billion dollar capital expenditure. The projection revealed an existential threat: if AI was the future of Google services, the existing hardware trajectory (CPUs) and the alternative (GPUs) were insufficient.

1.3 The Stealth Project and the 15-Month Sprint

Faced with this “compute cliff,” Google initiated a covert hardware project in 2013 to build a custom Application-Specific Integrated Circuit (ASIC) that could accelerate machine learning inference by an order of magnitude.⁴

Led by Norm Jouppi, a distinguished hardware architect known for his work on the MIPS processor, the team operated on a frantic 15-month timeline.⁴ They prioritized speed over perfection, shipping the first silicon to data centers without fixing known bugs, relying instead on software patches. The chip was packaged as an accelerator card that fit into the SATA hard drive slots of Google’s standard servers, allowing for rapid deployment without redesigning server racks.⁴

For nearly two years, these chips—the TPU v1—ran in secret, powering Google Search, Translate, and the AlphaGo system that defeated Lee Sedol in 2016, all while the outside world remained oblivious.³

Part II: The Architecture of Efficiency — TPU v1

The TPU v1 was a domain-specific accelerator designed strictly for inference.

2.1 The Systolic Array: The Heart of the Machine

The defining feature of the TPU is the Matrix Multiply Unit (MXU) based on a Systolic Array. Unlike a CPU, which constantly reads and writes to registers (the “fetch-decode-execute-writeback” cycle), a systolic array flows data through a grid of Multiplier-Accumulators (MACs) in a rhythmic pulse.

In the TPU v1, this array consisted of 256 x 256 MACs.⁶
The result:

Data Reuse: A single input is used for 256 operations before being discarded.
Density: Control logic occupied only 2% of the die area, allowing more space for ALUs.⁶

2.2 Quantization: The 8-Bit Gamble

The TPU v1 aggressively used Quantization, operating on 8-bit integers (INT8) rather than the standard 32-bit floating-point numbers.⁶ This decision quadrupled memory bandwidth and significantly reduced energy consumption, as an 8-bit integer addition consumes roughly 13x less energy than a 16-bit floating-point addition.⁷

2.3 Technical Specifications and Competitive Landscape

When published in 2017, the specifications revealed a processor radically specialized compared to its contemporaries.

Feature	TPU v1	NVIDIA K80 GPU	Intel Haswell CPU
Primary Workload	Inference (INT8)	Training/Graphics (FP32)	General Purpose
Peak Performance	92 TOPS (8-bit)	2.8 TOPS (8-bit)	1.3 TOPS (8-bit)
Power Consumption	~40W (Busy)	~300W	~145W
Clock Speed	700 MHz	~560-875 MHz	2.3 GHz+
On-Chip Memory	28 MiB (Unified Buffer)	Shared Cache	L1/L2/L3 Caches

Data compiled from.⁴

The TPU v1 achieved 92 TeraOps/second (TOPS) while consuming only 40 Watts, providing a 15x to 30x speedup in inference and a 30x to 80x improvement in energy efficiency (performance/Watt) compared to contemporary CPUs and GPUs.⁶

Part III: The Patterson Doctrine and Domain Specific Architectures

The technical success of the TPU v1 validated the theories of David Patterson, a Turing Award winner who joined Google as a Distinguished Engineer in 2016.⁸

3.1 The End of General Purpose Scaling

Patterson argued that Moore’s Law (transistor density) and Dennard Scaling (power density) were failing. Consequently, the only path to significant performance gains—10x or 100x—was through Domain Specific Architectures (DSAs).¹⁰

3.2 The DSA Philosophy

The TPU is the archetypal DSA. By removing “general purpose” features like branch prediction and out-of-order execution, the TPU devotes nearly all its silicon to arithmetic. Patterson noted that for the TPU v1, the instruction set was CISC (Complex Instruction Set Computer), sending complex commands over the PCIe bus to avoid bottlenecking the host CPU.⁶

Part IV: The Pivot to Training (TPU v2 and v3)

To free itself from NVIDIA GPUs, Google needed a chip capable of training, which requires higher precision (floating point) and backpropagation.

4.1 TPU v2: The Introduction of Cloud TPU (2017)

Introduced in 2017, the TPU v2 was a supercomputing node featuring:

High Bandwidth Memory (HBM): Replacing DDR3 to provide 600 GB/s throughput.¹³
Inter-Core Interconnect (ICI): Dedicated links allowing 512 chips to form a 2D Torus network, creating a coherent supercomputer called a TPU Pod.¹³
Performance: A 4-chip module delivered ~180 TFLOPS.¹⁴

4.2 The Invention of bfloat16

Google researchers invented the bfloat16 (Brain Floating Point) format for TPU v2. By truncating the mantissa of a 32-bit float but keeping the 8-bit exponent, they achieved the numerical stability of FP32 with the speed and memory density of FP16.¹⁴ This format has since become an industry standard.

4.3 TPU v3: Breaking the Thermal Wall (2018)

The TPU v3 pushed peak compute to 123 TFLOPS per chip.¹⁵ However, the power density was too high for air cooling. Google introduced liquid cooling directly to the chip, allowing v3 Pods to scale to 1,024 chips and deliver over 100 PetaFLOPS.¹⁶

Part V: The Optical Leap — TPU v4 (2020)

For the Large Language Model (LLM) era, Google needed exascale capabilities.

5.1 Optical Circuit Switching (OCS)

TPU v4 introduced Optical Circuit Switches (OCS). Instead of electrical switching, OCS uses MEMS mirrors to reflect light beams, reconfiguring the network topology on the fly (e.g., from 3D Mesh to Twisted Torus).¹⁸ This allowed v4 Pods to scale to 4,096 chips and 1.1 exaflops of peak compute.¹⁸

5.2 The SparseCore

To accelerate recommendation models (DLRMs), which rely on massive embedding tables, TPU v4 introduced the SparseCore. These dataflow processors accelerated embeddings by 5x-7x while occupying only 5% of the die area.¹⁹

5.3 Exascale AI and PaLM

The v4 Pods were used to train PaLM (Pathways Language Model) across two Pods simultaneously, achieving a hardware utilization efficiency of 57.8%.²⁰

Part VI: Divergence and Specialization — TPU v5 (2023)

In 2023, Google bifurcated the TPU line to address different market economics.

TPU v5e (Efficiency): Optimized for cost-effective inference and smaller training jobs, delivering 197 TFLOPS (bf16) per chip.²¹
TPU v5p (Performance): Designed for brute-force scale, offering 459 TFLOPS (bf16) per chip and scaling to 8,960 chips in a single pod.²¹

Part VII: Trillium — TPU v6 (2024)

Announced in May 2024, Trillium (TPU v6e) focused on the “Memory Wall.”

Compute: 918 TFLOPS (bf16) per chip (4.7x increase over v5e).²¹
Memory: 32 GB HBM with 1,600 GB/s bandwidth.²¹
Efficiency: 67% more energy-efficient than v5e.

Part VIII: Ironwood — TPU v7 and the Age of Inference (2025)

In April 2025, Google unveiled its most ambitious silicon to date: TPU v7, codenamed Ironwood. While previous generations chased training performance, Ironwood was explicitly architected for the “Age of Inference” and agentic workflows.

8.1 The Specs of a Monster

Ironwood represents a massive leap in raw throughput and memory density, designed to hold massive “Chain of Thought” reasoning models in memory.

Peak Compute: 4,614 TFLOPS (FP8) per chip.¹⁴ (Note: approx. 2.3 PFLOPS in BF16).
Memory: 192 GB of HBM per chip.
Bandwidth: A staggering 7.4 TB/s of memory bandwidth.
Interconnect: The ICI bandwidth was boosted to support 9.6 Tbps (aggregate bidirectional) to enable massive model parallelism.

8.2 Scale and Efficiency

A single Ironwood pod can scale to 9,216 chips.¹⁴ Google claims Ironwood delivers 2x the performance per watt compared to the already efficient Trillium (v6e). This efficiency is critical as data centers face power constraints; Ironwood allows Google to deploy agentic models that “think” for seconds or minutes (inference-heavy workloads) without blowing out power budgets.

Part IX: Environmental Impact and Sustainability

David Patterson and Jeff Dean championed the metric of Compute Carbon Intensity (CCI).²² Their research highlights that the vertical integration of the TPU—including liquid cooling and OCS—reduces the carbon footprint of AI. TPU v4, for instance, reduced CO2e emissions by roughly 20x compared to contemporary DSAs in typical data centers.²⁰

Part X: Comparative Analysis — TPU vs. GPU

Feature	TPU v1 (2015)	TPU v4 (2020)	TPU v5p (2023)	TPU v6e (2024)	TPU v7 (2025)
Codename	–	Pufferfish	–	Trillium	Ironwood
Use Case	Inference	Training	LLM Training	Efficient Training	Agentic Inference
TFLOPS	0.092 (INT8)	275 (BF16)	459 (BF16)	918 (BF16)	~2,300 (BF16)
HBM	–	32 GB	95 GB	32 GB	192 GB
Bandwidth	34 GB/s	1,200 GB/s	2,765 GB/s	1,600 GB/s	7,400 GB/s
Pod Size	N/A	4,096	8,960	256	9,216

Table compiled from ^{6, 15, 21}.

Conclusion

The TPU is not merely a chip; it is a “Silicon Sovereign.” From the coffee table to the Ironwood pod, Google has successfully decoupled its AI destiny from the merchant market, building a machine that spans from the atom to the algorithm.

Works cited

TPU transformation: A look back at 10 years of our AI-specialized …, accessed November 25, 2025, https://cloud.google.com/transform/ai-specialized-chips-tpu-history-gen-ai
Lost + Found Part 2 – Tell Us Something, accessed November 25, 2025, https://www.tellussomething.org/stories/2025/lost-and-found/lost-found-part-2/
Google’s Self-developed Chip Empire – metrans, accessed November 25, 2025, https://www.metrans.hk/Google%E2%80%99s-Self-developed-Chip-Empire/
An in-depth look at Google’s first Tensor Processing Unit (TPU) | Google Cloud Blog, accessed November 25, 2025, https://cloud.google.com/blog/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
Our 10 biggest AI moments so far – Google Blog, accessed November 25, 2025, https://blog.google/technology/ai/google-ai-ml-timeline/
In-Datacenter Performance Analysis of a Tensor Processing … – arXiv, accessed November 25, 2025, https://arxiv.org/abs/1704.04760
Google Reveals Technical Specs and Business Rationale for TPU Processor | TOP500, accessed November 25, 2025, https://www.top500.org/news/google-reveals-technical-specs-and-business-rationale-for-tpu-processor/
Evaluation of the Tensor Processing Unit: A Deep Neural Network Accelerator for the Datacenter – EECS at Berkeley, accessed November 25, 2025, https://eecs.berkeley.edu/research/colloquium/170315-2/
‘Retired’ UC Berkeley Professor Tackles Google Tensor Processing Unit – HPCwire, accessed November 25, 2025, https://www.hpcwire.com/2017/05/09/retired-uc-berkeley-professor-tackles-google-tensor-processing-unit/
AI targets, rather than Moore’s Law, to drive chip industry – HEXUS.net, accessed November 25, 2025, https://hexus.net/tech/news/industry/120146-ai-targets-rather-moores-law-drive-chip-industry/
Lessons of last 50 years of Computer Architecture – 1. Raising the hardware/software interface creates – Edge AI and Vision Alliance, accessed November 25, 2025, https://www.edge-ai-vision.com/wp-content/uploads/2020/12/Patterson_UCBerkeley_2020_Embedded_Vision_Summit_Slides_Final.pdf
A New Golden Age for Computer Architecture – Communications of the ACM, accessed November 25, 2025, https://cacm.acm.org/research/a-new-golden-age-for-computer-architecture/
TPU architecture | Google Cloud Documentation, accessed November 25, 2025, https://cloud.google.com/tpu/docs/system-architecture-tpu-vm
Tensor Processing Unit – Wikipedia, accessed November 25, 2025, https://en.wikipedia.org/wiki/Tensor_Processing_Unit
TPU v4 – Google Cloud Documentation, accessed November 25, 2025, https://docs.cloud.google.com/tpu/docs/v4
What Is a Tensor Processing Unit (TPU)? – Built In, accessed November 25, 2025, https://builtin.com/articles/tensor-processing-unit-tpu
Tearing Apart Google’s TPU 3.0 AI Coprocessor – The Next Platform, accessed November 25, 2025, https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/
[2304.01433] TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings – arXiv, accessed November 25, 2025, https://arxiv.org/abs/2304.01433
TPU architecture – Google Cloud Documentation, accessed November 25, 2025, https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
TPU v4 enables performance, energy and CO2e efficiency gains | Google Cloud Blog, accessed November 25, 2025, https://cloud.google.com/blog/topics/systems/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains
TPU v6e – Google Cloud Documentation, accessed November 25, 2025, https://docs.cloud.google.com/tpu/docs/v6e
David Patterson’s research works | University of California, Berkeley and other places, accessed November 25, 2025, https://www.researchgate.net/scientific-contributions/David-Patterson-2054732586