The inverse square root of a floating-point number 1x\frac{1}{\sqrt x}x1 is used in calculating normalized vectors, which are in turn extensively used in various simulation scenarios such as computer graphics (e.g., to determine angles of incidence and reflection to simulate lighting).…
2025
If you are reading this chapter sequentially from the beginning, you might be wondering: why would I introduce integer arithmetic after floating-point one? Isn’t it supposed to be easier? True: plain integer representations are simpler. But, counterintuitively, their simplicity allows for more…
Compared to other arithmetic operations, division works very poorly on x86 and computers in general. Both floating-point and integer division is notoriously hard to implement in hardware. The circuitry takes a lot of space in the ALU, the computation has a lot of stages, and as the result, div and…
In 1940, a British mathematician G. H. Hardy published a famous essay titled “A Mathematician’s Apology” discussing the notion that mathematics should be pursued for its own sake rather than for the sake of its applications. Similar to mathematics, the various fields of computer science also form a…
Computers usually store time as the number of seconds that have passed since the 1st of January, 1970 — the start of the “Unix era” — and use these timestamps in all computations that have to do with time. We humans also keep track of time relative to some point in the past, which usually has a…
In modular arithmetic (and computational algebra in general), you often need to raise a number to the nnn-th power — to do modular division, perform primality tests, or compute some combinatorial values — and you usually want to spend fewer than Θ(n)\Theta(n)Θ(n) operations calculating it. Binary…
Fermat’s theorem allows us to calculate modular multiplicative inverses through binary exponentiation in O(logn)O(\log n)O(logn) operations, but it only works with prime modula. There is a generalization of it, Euler’s theorem, stating that if mmm and aaa are coprime, then aϕ(m)≡1(modm) a^{\phi(m)}…
Unsurprisingly, a large fraction of computation in modular arithmetic is often spent on calculating the modulo operation, which is as slow as general integer division and typically takes 15-20 cycles, depending on the operand size. The best way to deal this nuisance is to avoid modulo operation…
How long does it take to add two numbers together? Being one of the most frequently used instructions, add by itself only takes one cycle to execute. So, if the data is already loaded into registers, it takes one just cycle. But in the general case (*c = *a + *b), we need to fetch its operands from…
Modern computer memory is highly hierarchical. It consists of multiple cache layers of varying speed and size, where higher levels typically store most frequently accessed data from lower levels to reduce latency: each next level is usually an order of magnitude faster, but also smaller and/or more…
Early operating systems gave every process the freedom of reading and modifying any memory region they want, including those allocated for other processes. While this keeps things simple, it also poses some problems: What if one of the processes is buggy or outright malicious? How do we prevent it…
To reason about the performance of memory-bound algorithms, we need to develop a cost model that is more sensitive to expensive block I/O operations but is not too rigorous to still be useful. #Cache-Aware Model In the standard RAM model, we ignore the fact that primitive operations take unequal…
Now, let’s try to design some actually useful algorithms for the new external memory model. Our goal in this section is to slowly build up more complex things and eventually get to external sorting and its interesting applications. The algorithm will be based on the standard merge sorting algorithm,…
In this section, we will apply external sorting and joining to solve a problem that seems useless on the surface but is actually a key primitive used in a large number of external memory and parallel algorithms. Problem. Given a singly-linked list, compute the rank of each element, equal to its…
You can control the I/O operations of your program manually, but most of the time people just rely on automatic bufferization and caching, either due to laziness or because of the computing environment limitations. But automatic caching comes with its own challenges. When a program runs out of…
In the context of the external memory model, there are two types of efficient algorithms: Cache-aware algorithms that are efficient for known BBB and MMM.Cache-oblivious algorithms that are efficient for any BBB and MMM. For example, external merge sorting is a cache-aware, but not cache-oblivious…
To precisely assess the performance of an algorithm in terms of its memory operations, we need to take into account multiple characteristics of the cache system: the number of cache layers, the memory and block sizes of each layer, the exact strategy used for cache eviction by each layer, and…
In the previous chapter, we studied computer memory from a theoretical standpoint, using the external memory model to estimate the performance of memory-bound algorithms. While the external memory model is more or less accurate for computations involving HDDs and network storage, where cost of…
On the data path between the CPU registers and the RAM, there is a hierarchy of caches that exist to speed up access to frequently used data: the layers closer to the processor are faster but also smaller in size. The word “faster” here applies to two closely related but separate timings: The delay…
Despite that bandwidth is a more complicated concept, it is much easier to observe and measure than latency: you can simply execute a long series of independent read or write queries, and the scheduler, having access to them in advance, reorders and overlaps them, hiding their latency and maximizing…
The basic units of data transfer in the CPU cache system are not individual bits and bytes, but cache lines. On most architectures, the size of a cache line is 64 bytes, meaning that all memory is divided in blocks of 64 bytes, and whenever you request (read or write) a single byte, you are also…
Starting at some level of the hierarchy, the cache becomes shared between different cores. This reduces the total die area and lets you add more cores on a single chip but also poses some “noisy neighbor” problems as it limits the effective cache size and bandwidth available to a single execution…
Memory requests can overlap in time: while you wait for a read request to complete, you can send a few others, which will be executed concurrently with it. This is the main reason why linear iteration is so much faster than pointer jumping: the CPU knows which memory locations it needs to fetch next…
Taking advantage of the free concurrency available in memory hardware, it can be beneficial to prefetch data that is likely to be accessed next if its location can be predicted. This is easy to do when there are no data of control hazards in the pipeline and the CPU can just run ahead of the…
The fact that the memory is partitioned into 64B cache lines makes it difficult to operate on data words that cross a cache line boundary. When you need to retrieve some primitive type, such as a 32-bit integer, you really want to have it located on a single cache line — both because retrieving two…
In the pointer chasing benchmark, for simplicity, we didn’t use actual pointers, but integer indices relative to a base address: The memory addressing operator on x86 is fused with the address computation, so the k = q[k] line folds into just a single terse instruction that also does multiplication…
Consider a strided incrementing loop over an array of size N=221N=2^{21}N=221 with a fixed step size of 256: And then this one, with the step size of 257: Which one will be faster to finish? There are several considerations that come to mind: At first, you think that there shouldn’t be much…
Consider yet again the strided incrementing loop: We change the stride DDD and increase the array size proportionally so that the total number of iterations NNN remains constant. As the total number of memory accesses also remains constant, for all D≥16D \geq 16D≥16, we should be fetching exactly…
It is often beneficial to group together the data you need to fetch at the same time: preferably, on the same or, if that isn’t possible, neighboring cache lines. This improves the spatial locality of your memory accesses, positively impacting the performance of memory-bound algorithms. To…
When you try to explain a complex concept, it is generally a good idea to give a very simple and minimal example illustrating it. This is why in this book you see about a dozen different ways of calculating the sum of an array, each highlighting a certain CPU feature. But the main purpose of this…
In this section, we will derive a variant of gcd that is ~2x faster than the one in the C++ standard library. #Euclid’s Algorithm Euclid’s algorithm solves the problem of finding the greatest common divisor (GCD) of two integer numbers aaa and bbb, which is defined as the largest such number ggg…
Computing the minimum of an array is easily vectorizable, as it is not different from any other reduction: in AVX2, you just need to use a convenient _mm256_min_epi32 intrinsic as the inner operation. It computes the minimum of two 8-element vectors in one cycle — even faster than in the scalar…
The prefix sum, also known as cumulative sum, inclusive scan, or simply scan, is a sequence of numbers bib_ibi generated from another sequence aia_iai using the following rule: b0=a0b1=a0+a1b2=a0+a1+a2… \begin{aligned} b_0 &= a_0 \\ b_1 &= a_0 + a_1 \\ b_2 &= a_0 + a_1 + a_2 \\ &\ldots…
In this case study, we will design and implement several algorithms for matrix multiplication. We start with the naive “for-for-for” algorithm and incrementally improve it, eventually arriving at a version that is 50 times faster and matches the performance of BLAS libraries while being under 40…
Optimizing data structures is different from optimizing algorithms as data structure problems have more dimensions: you may be optimizing for throughput, for latency, for memory usage, or any combination of those — and this complexity blows up exponentially when you need to process multiple query…
While improving the speed of user-facing applications is the end goal of performance engineering, people don’t really get excited over 5-10% improvements in some databases. Yes, this is what software engineers are paid for, but these types of optimizations tend to be too intricate and…
This section is a follow-up to the previous one, where we optimized binary search by the means of removing branching and improving the memory layout. Here, we will also be searching in sorted arrays, but this time we are not limited to fetching and comparing only one element at a time. In this…
In the previous article, we designed and implemented static B-trees to speed up binary searching in sorted arrays. In its last section, we briefly discussed how to make them dynamic back while retaining the performance gains from SIMD and validated our predictions by adding and following explicit…
The lessons learned from optimizing binary search can be applied to a broad range of data structures. In this article, instead of trying to optimize something from the STL again, we focus on segment trees, the structures that may be unfamiliar to most normal programmers and perhaps even most…
Notes on Myself: The book starts with Keith talking about himself. It begins with him talking about how he felt so creative and inspired as a child, and was disappointed to see that everything seemed to become colorless and dull as he got older. This childlike state he calls "the visionary world".…
No articles.